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I. INTRODUCTION 

The design and implementation of programming languages is a complex problem 
which must be addressed from at least four distinct viewpoints. These viewpoints reflect 
the different but interacting interests of the designer, implementer, user, and theoretician. 
We address specifically the kinds of problems evident in the following two scenarios: 

Scenario 1: The old dangling ELSE problem. 

An early ALGOL grammar in Backus Nauer Form (BNF) was ambiguous with 
respect to nested IF-THEN-ELSE statements. This was noticed by implementers who 
often adopted the fairly local solution of attaching an ELSE to the most recent 
available THEN. Although BNF grammars were eventually discovered corresponding 
to this resolution, the grammar for ALGOL was rewritten to simply forbid nested 
conditionals [Nauer 1963]. 

Scenario 2: A new, theoretically sound approach. 

This is a summary of advice given for the construction of deterministic parsers 
and translators in The Theory of Parsing, Translation and Compiling [Aho & Oilman 
1972]. 

1) Write your grammar in BNF. 

2) Decide whether you want top-down or bottom-up parsing (top down is more 
flexible for translation). 

3a) If you choose top-down: apply known transformations to the grammar and 
check the result for the LL(1) property. If successful, a reliable top-down 
parser may automatically be constructed which handles a general class of 
syntax directed translation. 



3b) If you choose bottom-up: attempt to modify the grammar to satisfy the 
SLR(l) or LALR(l) conditions. If successful, a bottom-up parser may be 
likewise constructed. 

4) In both cases, especially bottom-up, apply known optimizing transformations 
to the parsers to attain practical efficiency. 

In the first scenario BNF is being used as the formal reference toot, since it enables 
precise syntactic description. It does not, however, reveal important properties (e.g. 
ambiguity) which the language designer needs to consider. Farther, the imptementer must 
work informally, since the grammar itself does not suggest efficient parsing techniques (see 
the survey of various approaches in ALGOL 60 Implementation [Rarrdetl W64]). Finally, 
evidence indicates that the user may also be using informal syntactic models (see the 
description of expression evaluation in Introduction to /4LG0L [Baumann 1964]). This 
situation precludes any serious attempt at formal verification. 

A considerable amount of rigor has been obtained via the formal approach in the 
second scenario, but Aho and Ullman acknowledge several shortcomings. Many grammars 
cannot be made LL(l) and, even when they can, the resulting grammars are usually large 
and awkward and thus unnatural for syntax directed translation rules. Formal techniques 
do not exist for obtaining SLR or LALR grammars. Finally, in both cases nontriviat 
changes to the original grammar usually require that the entire process be repeated. 

A fundamental weakness with these approaches is that BNF is inappropriate as a 
definitional metalanguage; it is essentially based on theories of generative grammars. The 
practical demands of parsing and translating restrict us to certain "tractable" grammars, but 
such grammars are often very difficult to recognize. In addition these "tractable" grammars 
tend not to include the most convenient description of a language, so one usually ends up 
with several representations for the same language definition; eg., a simple grammar for the 
user, and a complicated one for the parser. Finally, it is often necessary to transform the 
grammar into a parse table and then into an optimized parse table. Such multiple 
representations form a severe obstacle to formal verification. 

What we would like, then, is a system which includes: 

1) A natural and convenient definitional meta-language for the designer, 



2) A user oriented meta-language which makes any defined language easy to learn 

and use, 

3) A simple method for automatically constructing an efficient parser/translator for 

any defined language, and 

4) Enough precision in the above to permit formal proof that all components agree 

precisely. 

Pratt presents a system in "Top Down Operator Precedence" [Pratt 1973] which 
addresses the first three of these issues quite well. He allows the implementer to "write 
arbitrary programs" while offering "in place of the rigid structure of a BNF-oriented 
meta-language a modicum of supporting software, and a set of guidelines on how to write 
modular, efficient, compact and comprehensible translators while preserving the impression 
that one is really writing a grammar rather than a program." This approach has been 
followed in the construction of CGOL, a combination definitional meta-language and 
extensible programming language [Pratt 1974] which demonstrates the power and 
convenience inherent in this approach. 

The CGOL system, as presented, does not satisfy the fourth criterion; it lacks a 
complete formal context in which correctness may be stated and proven. In this paper we 
complete a formal context, present an example implementation, and rigorously prove its 
correctness. 

We believe that many of the difficulties mentioned above may be avoided by writing 
grammars in a meta-language whose descriptive power is tailored to fit the intended 
application. We present and analyze such a meta-language for CGOL type translation; the 
meta-language expresses a class of languages which are easily and naturally parsed. For an 
exact definition of the describable languages, we present a user-oriented model which 
describes how sentences may be generated from any grammar. 

Since the meta-language is designed to fit the parsing method, it is possible to 
construct an extremely simple parsing program which operates by simply reading a given 
grammar as data. We give a LISP implementation of this parser, designed primarily for 
clarity and ease of proof. 

The correctness proof for the example parser is presented in two parts; theoretical 
properties and a program proof. The theorems of the first part deal exclusively with 



properties of the meta-language; these proofs are completely independent of the program 
and the parsing algorithm. The use of these properties allows the actual program proof to 
deal almost exclusively with argument passing and f tow of control; the program proof is 
tedious but straightforward. 

Chapter II contains an introduction and analysis of the CGOL approach to parsing. 
Chapter III is an introduction and informal discussion of our system: the syntactic 
meta-language, the generative model of defined languages, die parsing program, and 
correctness criteria. Chapter IV covers the same material with complete formal definitions, 
and Chapter V contains the correctness proof 



II. THE CGOL APPROACH 

We begin with a presentation and analysis of the parsing/translating method 
proposed by Pratt; a motivation and detailed introduction may be found in "Top Down 
Operator Precedence" [Pratt 1973]. The discussion in this chapter centers on the parsing 
technique: how it works, what features yield unique advantages, and how it compares with 
known work in formal parsing theory. 



II.A The Algorithm 

Pratt's approach to translation (which we refer to as the CGOL approach, after its 
application in [Pratt 1974]), is specifically oriented toward the translation of expressions, 
where an expression is simply an operator (e.g., + or *) with its arguments. For those not 
familiar with expression oriented programming languages, the analogy to arithmetic 
expressions is sufficient for the moment. Each operator of the defined language has 
associated with it a program which embodies most syntactic and semantic information for 
that operator. The programs, called denotations, are executed in a left to right scan by a 
simple, recursive algorithm; each denotation has the power to look at the next symbol in the 
input string, advance (but not back up) the current symbol pointer, and call the parsing 
algorithm recursively to scan another expression. The pointer to the input string is a global 
variable and may be advanced by any denotation. The denotation of a symbol may be 
called at two points in the algorithm: step 2 and step 4. Step 2 corresponds to the case 
where the operator is at the beginning of a string and does not take a left argument. Step 4 
assumes that the expression parsed so far is the left argument to the operator. 

PARSE is the function which is called to scan and translate an expression starting at the 
beginning of an input string. 

STEP 1: PARSE looks at the first symbol of its input string (it will never look farther ahead 
than the current pointer to the string). Since this symbol occurs at the beginning of an 
expression, it is assumed to be an operator which takes no argument on its left side 
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(constants and variables are treated as operators with no arguments). PARSE executes 
the denotation associated with this symbol. 

STEP 2: The denotation for the current operator moves the pointer rightward along the 
input string, when necessary to gather right arguments. The denotation returns the 
translation of this expression, leaving the input pointer at the symbol following the 
expression. 

STEP 3: PARSE now has the translation of an expression starting at the beginning of the 
input string. The question is asked: should this expression be given as a left 
argument to the next operator in the string, or should it be returned (presumably as a 
right argument to the caller of PARSE)? The decision is made by comparing numerical 
binding powers associated with each operator; the next symbol must have a left 
binding power associated with it, and PARSE was given, as an argument, the right 
binding power of its caller. 

STEP 4a: If the right binding power of the caller is greater (or equal), the translation 
obtained so far is returned. RETURN. 

STEP 4b: If the left binding power of the next symbol is greater, then it is assumed to be 
an operator, and the expression translated so far is its left argument. The translation 
is passed as an argument to the execution of the denotation associated with this 

symbol. 

STEP 5: The denotation for this operator moves the pointer rightward along the input 
string, when necessary to gather right arguments. The denotation returns the 
translation of this expression, leaving the input pointer at the symbol following the 
expression. 

STEP 6: Iterate to step 3. 

We observe that the definitional information for each operator falls into four general 
categories. In the first category we include the specification of the operator's left and right 
binding powers; these integers are used to locate the right ends of expressions. The second 



category simply indicates the presence or absence of a left argument. This feature belongs 
in a separate category since the collection of a left argument is not directly controlled by a 
denotation;i.e., when a denotation is executed, its left argument, if any, has already been 
scanned and translated. Denotation for operators with a left argument are executed from 
Step 4b, those without from Step 2. The third category includes a procedure for right 
argument collection which may invoke a number of techniques, the most obvious of which 
is the collection of an expression (argument) by recursively calling the parser. In addition, 
parsing decisions may be made by looking one symbol ahead in the input string. The 
fourth category includes a procedure for translation. 



1KB Comparisons with other Methods 

From a theoretical standpoint a CGOL translator has unlimited syntactic power. 
This is not, however, the primary issue; it is much more important to ask what it can do 
well. We provide one answer to this question by comparing the algorithm to a number of 
known parsing methods, showing how CGOL combines certain advantages of each. This 
discussion presupposes some familiarity with formal parsing theory. The topics discussed 
are: 

1. Introduction and Example Grammars 

2. The Parse Type 

3. Skeletal Grammars, Ambiguity 

4. Operator Languages, Precedence Parsing 

5. Flow of Control 

6. Combination Unary/Binary Operators 



1. Introduction and Example Grammars 

The key to the effectiveness of the CGOL parser is the simple but powerful control 
structure. The syntactic power of the parser is theoretically unlimited, since arbitrary 
programs may be written as denotations; the control structure, however, creates an 
environment in which a great many grammatical constructs may be handled very simply 
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The language CGOL presented in [Pratt 1974] and the translator constructor in this paper 
are examples. This flexibility and convenience result from a unique combination of parsing 
techniques, most of them well known by themselves. Rather than asking to which 
theoretical class CGOL belongs, we look for similarities between the operation of the 
CGOL parser and those Hi known categories. CGOL combines advantages from many 
different approaches. 

We will refer to the following grammars in this discussion. They illustrate in a 
simple way several of the issues relevant to parsing schemes. Example A is an ambiguous 
grammar for the language of arithmetic expressions; A* is a standard unambiguous version 
in which + and * associate to the left and t associates to the right. These properties result 
from the use of single productions and left and right recursion. B is an ambiguous 
grammar for IF-THEN-ELSE statements (the weH*r«>wn dangling ELSE problem). 
Grammar B* is an unambiguous grammar for the same language, representing the usual 
solution to the problem. 

Grammar A 

1 E - E + E 

2 E - E * E 

3 E -» E t E. 

4 E -» ( E ) 

5 E -» a 

Grammar A' 

1 E -♦ E + T 

2 E -> T 

3 T - T « F 

4 T -► F 

5 F-PtF 

6 F -» P 

7 P - ( E ) 

8 P -♦ a 
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If we trace the operation of the CGOL parser, observing the order in which the 
components of the parse tree are recognized and assembled, we see that it is essentially 
producing a left corner (LC) parse. We begin the discussion of this observation with a 
brief look at top-down parsing. Parse types are categorized top-down, bottom-up, etc. 
according to the order in which they recognize the grammar rules used to derive the input 
sentence. An equivalent model is to imagine the derivation as a tree with the root 
nonterminal symbol at the top, and the leaves corresponding to the sentence. A top-down 
parser recreates this tree from the top downward, root nonterminal first. Stearns points out 
that this type of parser is especially useful for combined parser/translators; since each 
production is identified before its descendents in the tree, an implementation may 
conveniently use recursive descent. Translation rules may correspond to grammar rules, 
which may correspond to nested environments in the translating program. These ideas are 
discussed at length in [Knuth 1968] and [Lewis & Stearns 1968]. 
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LL Languages 

The LL(k) grammars are those which can be naturally parsed deterministically (i.e., 
without backtrack as the input is scanned) from left to right, top-down. The usual parser 
associated with LL grammars is the predictive parser which looks ahead k symbols on the 
input stream before deciding which production to recognize at any given point in the parse. 
In addition to the general usefulness of top-down parsing, predictive parsers for LL(k) 
grammars are very simple; they may be implemented on a one-state Deterministic Push 
Down Automaton (DPDA) [Kurki-Suonio 1969]. Further, they are very efficient and handle 
errors reasonably well [Aho & Ullman 19721 

The central problem with LL parsing is that very few grarrr,.ai > are LL(k). In fact, 
very few languages have LL(k) grammars for any k; an example is grammar B', which 
generates a non-LL language. When languages do have LL(k) grammars, these are not 
always the smallest or most natural descriptions of the language. For example, Stearns 
discusses transformations which may convert grammars into LL(1) grammars at the expense 
of added complexity [Stearns 1971]. Grammar A' for arithmetic expressions is not an LL 
grammar for any k because of left recursion (in rules like E -» E + T). Left recursion may 
be eliminated by converting a grammar to Greibach Normal Form (via a known algorithm). 
The GNF grammar for arithmetic expressions is essentially right associative, although the 
old grammar parse may always be recovered from a new grammar parse. Stearns presents 
optimizations which reduce the nonterminal explosion in the case of arithmetic expressions 
(in general the transformation squares the number of nonterminal symbols), but the result 
depends heavily on the fact that this is an operator precedence language. This property of 
arithmetic expression grammars (such as A') allows a simpler treatment by the direct use of 
operator precedence (to be discussed below). 

Left Corner Parsing 

As mentioned above, LC parsing is a variant on top down parsing. While a top 
down parser must recognize the occurrence of a rule before any of its descendants, an LC 
parser does not until the leftmost descendant has been found. This leftmost descendant, the 
leftmost symbol in the right part of the rule, is called the left corner. This corresponds 
quite closely to the operation of the CGOL parser; each rule in CGOL corresponds to an 
operator, and each operator is recognized (its denotation executed) as it is encountered in a 



15 



left to right scan. Since operators may have expressions occurring as left arguments, they 
are recognized after their left corner. This parse method has been said to parse the left 
corner of a rule bottom-up and the rest of the rule top-down. When the first symbol of a 
rule is a nonterminal symbol, as with all NILFIX and PREFIX operators in CGOL, the 
parser is operating essentially top down. 

Nondeterministic LC parsing has been used for some time [Irons 1961] [Cheatham 
1967], but only more recent work has examined deterministic LC parsing. Rosenkrantz and 
Lewis identify the LC(k) languages, those which have LC(k) grammars and can be parsed 
deterministically LC with k symbol lookahead [Rosenkranti & Lewis 1970]. The class of 
LC(k) languages is shown to be identical to the class of LL(k) languages via the result that 
the elimination of left recursion produces an LL(k) grammar if and only if the original 
grammar was LC(k). Thus LC(k) grammars give us no ultimate increase in expressive 
power, but they do offer a naturalness and economy of description in many cases. In an 
LC(k) translator this advantage is gained at the cost of some potential flexibility (since left 
corner nonterminals may not be parsed top-down). An important advantage is that one rule 
corresponds to one operator, and the semantics for a rule may be conveniently localized. 

Grammar A' is LC(I), and thus a transformed version, without left recursion, is 
LL(1); in fact, this is nearly identical to the example transformed by Stearns in [Stearns 1971] 
where the number of nonterminals becomes squared under the transformation. Grammar 
B' , however, is not LL(k) for any k. In fact, it is intuitively clear that the language 
generated by B' is not an LL language, since it is impossible to tell at the begining of a 
string which of two rules is to be applied; there can be no LL(k) or LC(k) grammar which 
generates the language. 



3. Skeletal Grammars 

While the CGOL parser traces a left corner parse and operates with lookahead 1, it is 
not actually an LC parser as defined by Rosenkrantz and Lewis, since it uses no grammar 
in the ordinary sense. There is only one nonterminal in the parser, the implicit one for an 
expression. AH expressions are treated the same. What we have then is more like the 
grammar A' , sometimes called a skeletal grammar. Skeletal grammars typically are 
ambiguous, so external means need to be used to resolve any ambiguous sentences. The 
CGOL parser resolves this ambiguity by a number of techniques sometimes seen in parser 
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implementations, linear operator precedence functions, flow of control decisions, and 
two-state unary vs. binary operator recognition. Some of these techniques have been viewed 
as optimizations to be used whenever a grammar is found with the right property, although 
it is seldom obvious at a glance if this is the case. Techniques have even been developed to 
transform grammars in the hope that the desirable properties might be obtained. 

The CGOL approach is to avoid juggling context-free grammars at all. This is done 
by not attempting to describe difficult matters with cfg rules. These rules are certainly 
useful for describing phrase structure (as in the two ambiguous exampte grammars), but 
begin to grow in size and lose clarity when they describe operator hierarchies and 
association (as in grammar A). 



4. Operator Languages, Precedence Parsing 

Some of the information which is normally represented by nonterminal symbols may 
be defined as properties of the terminal symbols, if the languages are defined by operator 
grammars. These are context-free grammars which have no adjacent nonterminal symbols. 
Although these are something of a special case in the literature on formal languages, a great 
many useful programming languages have (or are very close to having) operator grammars. 
All four example grammars are operator grammars; see also [Floyd 1963] for an operator 
grammar for ALGOL. In fact, it seems that adjacent nonterminals usually appear when we 
try to solve some "problem" with a grammar (say ambiguity, or left recursion) by 
transforming it into something less natural. Rules with no nonterminal symbols at all are 
especially nonintuitive; we like to think of each rule as having some meaning, but when a 
rule has no associated terminal symbols, its occurrence relative to a sentence will only be 
implicit. In the CGOL parser each rule is attached to some symbol, an operator. With this 
restriction CGOL is able to apply the following techniques. 

Precedence Parsing 

The term precedence parsing describes a well known family of techniques used in 
bottom-up parsing. The standard implementation of a bottom-up parse is known a.« a 
shift-reduce algorithm. This algorithm scans the input, one symbol at a time, from left to 
right. A shift step reads an input symbol and pushes into onto a stack. A reduce step occurs 
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when a sequence of symbols on the top of the stack correspond to the right side of a 
grammar production; this leftmost reducible phrase is called the "handle" of a sentential 
form. This series of symbols is popped off the stack and is replaced by the nonterminal 
symbol on the left of the rule. A parse is complete when the stack contains only the root 
nonterminal of the language and the input stream is empty; the output is a bottom-up parse. 

Precedence parsing methods are distinguished by the method of making the 
shift-reduce decision, i.e. deciding if the scan has reached the right end of a handle. The 
general technique is to derive from the grammar a relation (usually written ►) on the 
symbols of the language. Although a variety of precedence techniques have been 
developed, their essential feature is that they compare two adjacent symbols in a sentential 
form; if the relationship > holds between them, the right end of a handle has been reached. 

Operator Precedence 

The application of precedence techniques to operator languages leads to a well known 
and efficient parsing method (see [Floyd 1963]). Operator precedence grammars are those 
for which the shift-reduce decision may be made uniquely by considering only terminal 
symbols; i.e., the uppermost terminal symbol on the stack is compared with the next input 
symbol. Considerable storage space and algorithmic complexity is saved by simply ignoring 
nonterminal symbols; i.e., not using them to carry information. The resulting parse tree is 
called the skeletal parse, since alt productions with single nonterminals on the right side are 
missing. The interesting structure is there, though, since extra nonterminals with rules like 
E -* T in Grammar A' are often included only to express properties like right or left 
association and have no semantic implications. 

Although operator precedence seems a somewhat obscure property for a grammar to 
have, Floyd argues that many useful programming languages are quite close to having 
operator precedence grammars. He offers an ALGOL operator precedence grammar as an 
example and identifies certain problems which he suggests be solved via escape clauses, or 
special parse techniques. It seems that the technique handles the majority of language 
features quite well, but has certain difficulties which would be much better dealt with by 
exception, than forced in the basic scheme. CGOL deals with some of these problems quite 
well. 

Pratt conjectures that operator precedence techniques are widely applicable because of 
their intuitive appeal; they correspond exactly to the ordinary conventions for writing 
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arithmetic operators. Grammar A' for example is an operator grammar in which the 
relations t > * >. + hold. These represent the notion of the precedence hierarchy of these 
operators. We also note that + > + and * >• *, meaning that these two operators associate 
to the left. On the other hand, the relation t < t holds; this means, in the operator 
precedence scheme, that this operator associates to the right. 

Linear Precedence Functions 

An optimization often considered for operator precedence schemes (and for 
precedence relations in general) is the encoding of the precedence matrix (i.e. the relation) 
via linear functions. Typically, two integer valued functions f anu g are defined over 
terminal symbols. If for two terminal symbols x and y the relation x * y holds, then it will 
also be true that f (x) > g (y) . While the technique only works for a small number of 
possible matrices, it turns out to be easily applicable to grammars like A'. Again, the 
conventional hierarchy of the operators in arithmetic expressions allows this encoding 
scheme to work. 

An operator precedence parser for arithmetic expressions is very compact and 
efficient. CGOL makes use of the operator precedence technique, but without forcing the 
designer to express his ideas in BNF first, only to have them transformed by algorithm into 
what might essentially be the original idea. The designer simply defines left and right 
binding powers for each operator. 

We recall that left corner parsing treats the left argument to an operator in bottom-up 
mode, and the rest of the rule in top-down mode. It is in the bottom-up mode that this 
technique is used by CGOL. When PARSE has scanned a complete expression, a decision is 
made by binding powers. If the next token of the string wins the expression, then the 
expression becomes a left argument. If PARSE returns the expression, then the expression is 
the result of a top-down call from some higher level. The operation of CGOL for 
grammars composed of only arithmetic operators, like 8, is exactly parallel to the operation 
of the canonical strong LC machine of Rosenkrantz and Lewis [Rosenkrantz & Lewis 1970]. 
The nested environments of CGOL correspond to the stack of the LC machine. An LC 
stack entry may either be a single nonterminal symbol, corresponding to a call to PARSE 
which has not yet parsed an expression, or a pair of nonterminal symbols, corresponding to 
a call to ASSOC which already has a left argument (or left corner) parsed, waiting to be 
attached to something. 
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5. Flow of Control 

A major difference between CGOL and the LC machine becomes clear when we 
consider grammar B*. This is an operator precedence grammar which is easily handled by 
traditional bottom-up methods, but it is not LL(k) for any k. By the result of Rosenkrantz 
and Lewis then it is also not LC(k) for any k. The CGOL parser handles this example 
with great ease, since the program for the operator IF can simply parse its THEN argument 
and then look one token ahead to see if it is ELSE. Both possibilities are treated by the 
same denotation, so we are using the equivalent of the ambiguous version, grammar B. As 
with arithmetic expressions, CGOL uses an ambiguous grammar with a simple rule to 
resolve ambiguity; in this case it is simply to take the ELSE if it is there. Aho, Johnson, and 
Ullman treat this example in some detail, pointing out that this solution is a simple fix to 
the otherwise ambiguous top-down parsing table for grammar B 

[Aho, Johnson, & Ullman 1973]. We have a situation where the top-down predictive parsing 
technique works for cases which are outside of the normally defined LL boundaries. By 
allowing arbitrary programs as denotations, CGOL allows an operator to collect any right 
arguments in a very general top-down fashion. We might say that each operator has its 
own top-down predictive parser for the grammar of its right arguments. It is this feature 
which allows the use of regular expressions to specify annotation patterns within the 
meta-language defined in this paper. In fact the restrictions placed on the use of the 
regular operators make each annotation pattern the equivalent of a miniature LL(1) 
language, although the restrictions are in fact even stronger than LL(I). 



6. Combination Unary/Binary Operators 

The third technique used to resolve ambiguity in a CGOL parser is a solution to a 
problem encountered by Floyd when he tried to write an operator precedence grammar for 
ALGOL. Certain symbols of the language have two uses, and operator precedence by itself 
can not distinguish between them. The common example of this is the minus operator 
which may be used either as unary or binary. CGOL allows this double definition in a 
general form. Any operator may have two unrelated definitions if one of them has a left 
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argument and one does not. CGOL is in this sense a two state machine, one state 
corresponding to an immediate call to PARSE, when no left argument is present, and the 
other to a call to ASSOC, when there is a left argument available. There is never any 
ambiguity. 
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III. BASIC CONCEPTS 

In this chapter we motivate and informally introduce the components of our language 
system. The notions presented will be given full formal treatment in the following chapters. 
We discuss first the meta-language, giving examples of its use. Since the meta-language is 
nonstandard, we will present a generative model which determines the sentences of a 
defined language. The chapter concludes with a brief discussion of the translator algorithm 
and its correctness criteria. 



I1I.A The Meta-Languafe 

Our formal language system is based on a syntactic meta-language which: 

(a) restricts the syntactic power of the system in a way which permits rigorous proofs, 

(b) embodies the full power of the scheme in the sense that we want it to express 
anything which the parse/translation scheme handles naturally and efficiently, and 

(c) allows the automatic construction of simple translators. 

We recall from Chapter II that the translator uses four types of information for each 
operator in the defined language: 

(1) Left and right binding powers, 

(2) Presence of left argument, 

(3) Pattern of right and annotated arguments, and 

(4) Translation rule 



22 



In the original' CGOL facility this information is specified by the designer in a varying 
mixture of declarative and procedural modes. To facilitate uniform treatment, we wilt allow 
exactly one typeof meta-language statement, a pmtuctlm, which wilt contain all of the data 
necessary to define a single operator. We restrict the syntactic power of the translator by 
requiring that aft syntactic inf o Tt rntti o w (pan* 1-3 » fotetf above) be stated tar a declarative 
language, leaving only the translation raft in procedural form. This declarative segment 
includes a template of argument positions (parts 2 and 3); and a specification of binding 
powers (part fit Thus we might writer 

Ex. I ~ "+" m , 14,14; <denot8tlort> 

to define + as an operator of the language with left and right arguments. It has left and 
right binding powers of 14, and <denotat ion> is a procedure which accepts as input the 
translations of the arguments and calculates the translation of the entire phrase. To deal 
with more general programming language features we iHew productions tike 

Ex.2 "IF" ~ "THEN" * ("ELSE" ~ | A) ,6i<denotaiion> 

which defines the standard conditional operator. This production includes the specification 
of extra right arguments (in addition to the normal one with IF acting as a prefix 
operator); we call "THEM* «* ("ELSE" ~- | A) the anmtatlon pattern of the operator 
" I F " . Here the alternation (or union) symbol | is used to specify a choice of two patterns, 
one of which is the nun string X. An even more powerful conditional may be specified by 
the production: 

Ex.3 "IF" ~ "THEN" - ("ELSEIF" ~ "THEN" ~ >* ("ELSE" ~ | A) ,6;<den> 

which uses the star closure symbol * to indicate any number of occurrences. 

We write annotation patterns using regular expression notation (as in Examples 2 
and 3} because it is well known, quite general, and amenable to formal treatment In an 
actual implementation one might extend this notation to include pattern operations 
expressible in terms of the basic notation. For example, we might introduce the brackets [ 
and I, and let tal denote fa|A). We could then write simply: 
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Ex.2' "IF" ~ "THEN" ~ ["ELSE" ~1 , G;<denotation> 

instead of Example 2. Another possibility might be < and > to mean + closure (one or 
more occurrences). Such extensions are not included here, since they do not affect the 
theoretical behavior of the meta-language. 

This meta-language is restricted enough to allow formal treatment (goal a above) and 
is general enough to exploit the power of the parsing scheme (goal b). The patterns, 
however, are too powerful for simple parsing (goal c); any of these patterns could 
theoretically be parsed, but not all of them easily or unambiguously. We solve this by 
restricting the class of permissible patterns to those within the power of a very simple 
parsing algorithm. 

This matching algorithm for patterns (arguments on the right side of operators) is 
deterministic and never looks more than one symbol ahead in the input string. Our model 
of the algorithm is a person with one finger on the pattern, one finger on the input string, 
and almost no memory! It should always be clear what to do next; no backing up allowed. 
To put this differently, the user should always be able to understand the parsing method. 
To insure the correct operation of the parser, we adopt the following three rules. 

The first rule is that patterns joined by alternation not begin with the same symbol. 
Thus we disallow the pattern: 

Ex.1' "IF" -v ("THEN" ~ | "THEN" ~ "ELSE" ~) ,6;<denotation> 

as an alternative to Example 1. In fact we prefer the original form for the following reason: 
an annotated argument should be identified by the name of the preceding symbol, not by its 
position in a pattern. We intend that there be no difference between the two THEN 
arguments specified in example 1'. 

The second rule solves a problem arising from the use of the symbol \ in patterns. 
Whenever the pattern \ is an alternative, the scanner could "match" X and miss a non-null 
matching symbol. This problem is solved by a fiat similar to the dangling ELSE solution. 
The parser will always match as much of the input string as possible; the pattern \ is always 
the lowest priority choice. 

The third rule prohibits certain other patterns which cannot be completely handled 
by the parser. For example, we consider the production 
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Ex.4 "FOG" ~ ("BAR" | X) "BAR" ,2;<denotatlon> 

which describes two possible phrases; one lias one occurrence of BAR, &e other has two. 
Because of the fiat above, our algorithm can only parse the second possibility correctly. 
This a local case of the dangftog ELSE problem, and since it is detectable, we disallow 
patterns in which it occurs. Informally, this rale restricts the use of patterns which give the 
parser a choice whether to continue, based on the presence of some delimiter symbol like 
ELSE. We witt require that such a pattern not be concatenated on thefcft with a pattern 
which can start with one of its delimiter symbols. In place of Example 4 we might use the 
production 

Ex. 4' "F00" ~ "BAR" ("BAR* 1 \) ,2r<deno*atlert> 

which matches the same phrases but can be parsed correctly. 

While not immediately obvious, these restrictions are completely local to each 
production and me intuitively motivated Patterns which violate them and can sometimes 
be rewritten in an acceptable form, and the acceptable form of ten makes more sense, in 
fact, the verification of these conditions is computationally quite simple and an interactive 
definitional facility would have verification and debugging aids built-in. These rules are 
considerably simpler and more intuitively appealing than the LL and Lit conditions. 

On a global level, use of the metalanguage is quite straightforward; the global 
restrictions which do exist are very simple. Only one production may be given per operator, 
although some symbols may be used for two different operators, one with a left argument 
and one without (eg., the binary and unary minus operators would be defined in two 
separate productions). A symbol defined as an operator may also be used as a delimiter (in 
annotation patterns) as long as its binding powers remain weH defined, since the role of a 
delimiter is passive. This sort of detail is trivially manageable by a definitional facility. 

An important property of this metalanguage is that a set of productions forms a 
complete language definition; no other information is necessary. It is precisely this extreme 
modularity which makes designing extensible languages convenient. 
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IH.B User Model 

Once we have a language definition, a set of productions, we want to offer the user a 
manual explaining how to use the defined language. We claim that the productions 
themselves are straightforward enough (and their syntactic interactions simple enough) to 
serve as the basic manual, once our generative model is understood. For precision and 
verification this model will be presented in formal terms. It should be understood, however, 
that the formalism is intended only to add rigor to intuition; intuition need not be bent in 
order to agree with formalism. Some of the assumptions on which the model is based are 
discussed in Top Down Operator Precedence [Pratt 73]. 

The operator is the basic definitional unit in these languages; appropriately, the 
user's primitive concept is the relation "is an argument or. We carry this one step further 
by specifying what kind of argument (what role it plays). Also, to allow more than one 
argument of the same kind, we specify an ordering. It is then natural to represent 
expressions as trees: nodes correspond to operators and subtrees correspond to arguments. 
The branches are ordered and labelled to identify the argument: normal arithmetic 
arguments are connected by branches labelled left or right, and annotated arguments are 
labelled by the annotating token itself, the delimiter. This is very closely related to 
McCarthy's abstract syntax [McCarthy 1963]. 

The purpose of syntactic convention is to uniquely represent these expression trees as 
linear strings of symbols. Two well known examples are the use of postfix and prefix 
notation to represent ordered trees. In the domain of binary trees infix notation is 
commonly used, but here additional conventions are necessary to resolve the association of 
intervening arguments. An example of this problem is the string a+b*c, where we know by 
convention that b is the left argument of the operator * and not the right argument of +. 
The convention used here is usually viewed as a hierarchy of the arithmetic operators in 
which the higher operators "go first" or "take precedence" over lower operators. We use this 
convention to recover the correct tree from a given string; it may also be used to determine 
which trees are directly expressible as strings, and which trees require the use of 
parentheses. 

The languages we define use a combination of notational conventions including 
infix. To deal with association problems we adopt a convention based on the idea of 
operator hierarchy. A binding power is a numerical value which represents the precedence 
level of an operator; thus an expression between two operators is understood to be an 
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argument of the operator with the higher binding power. This convention is generalized 
somewhat by allowing separately specified left and right binding powers for each operator, 
allowing operators to behave asymmetr fcafly. 

We incorporate this convention in a model for writing linear expressions from trees. 
The basic rule f or writing expressions is: donVuse an expression ©is a left (right) 
argument to an operator op if the left (right) binding power of op is high enough to cause 
any subexpression of e to associate incorrectly. We Know that a*to may be used as an 
argument to +, but »4b may not be used as an argument to ». Formally, we measure the 
resistance (on eaclt side) of an expression to false associations. We wftl define the r- index 
( l - i ndex) of an expression to be essentially the fewest right (feft) binding power of any 
internal operator exposed to the right (left) side of the expression. ?o. example, the 
r- i ndex of a*to*e te equal to the right binding power of +, since an operator to the right 
of this expression (say another *) might take b*e incorrectly as a left argument. The 
I -index of the expression SIN a is «, since it is totafly invulnerable to false associations 
on the left. Although this model does not allow certain expressions trees to be written, most 
defined languages include a bracketing operator (like parentheses) which is semantically null 
and creates an expression with l-index«r-lndex-0. Thus, (o+bl may be used as an 
argument to ». 

The only other way in which operators may syntactically interact results from the 
generalised dangling ELSE problem. The expression IF a WBi to has the property that 
an ELSE occurring immediately after to will cause the parser to continue collecting arguments 
for this phrase (retail the fiat: given the choice of continuing or not, the parser will always 
continue). The informal ftrie is: don't follow an expression e by a delimiter which will get 
incorrectly included with e (or some subexpression of e) This rule prohibits the use of an 
IF-THEN expression as the second argument (I.e., the THEN argument) to an 
IF-THEN-ELSE expression. We formalize this rule by defining the c-set for each 
expression, the set of tokens which would cause argument collection to continue incorrectly 
at some level. We say: an expression may not be followed by a token in its c-set. 

These three properties (I -index, r- index, and c-set) completely describe the 
syntactic behavior of any expression. A standard BNF grammar would represent the same 
information implicitly by the use of one nonterminal symbol. More closely related 
techniques have been studied which attach various modifiers to nonterminal symbols in 
context-free grammars; see especially Indexed Grammars" fAho 196*3 and the 
transformation defined on well-chained grammars in [Stearns 19711 The CCOL approach 



27 



is extreme in the sense that nonterminal symbols play virtually no role at all. 

The separate treatment of syntactic properties is an important feature of this 
approach; both designer and user can deal with the various syntactic issues explicitly and 
separately. The most prominent syntactic feature of a language is its basic phrase structure, 
expressed by the productions as an ambiguous context-free grammar with one nonterminal 
symbol (called "expression"). Argument association is dealt with separately by binding 
powers, similar to the arithmetic conventions. Pratt argues that binding powers may be 
usefully assigned on the basis of an implicit hierarchy of data types, corresponding closely 
to ordinary intuition and conventions for programming languages. The annotation patterns 
are also treated separately. Delimiters like ELSE which can cause problems can be explicitly 
noted (an easily computable property) and the operator combinations which interact can be 
listed. For example, it would be observed that IF-THEN-ELSE expressions interact with 
themselves if improperly nested. In a well designed language, these interactions will be 
rather limited in number, freeing the user from this concern in most cases. 



III.C Automatic Parsing 

Our meta-language defines a class of programming languages for which the CGOL 
translation technique is particularly appropriate. We demonstrate by presenting a simple 
parsing program which, when given a set of productions as data, correctly parses sentences 
of the defined language and can be easily extended to handle translation via denotation 
programs. The program is a working (although inefficient) LISP implementation which 
requires the transformation of productions into a suitable LISP representation. 

A definitional facility would be a set of programs to provide this and other services 
to the designer. The meta-language processor is a program which accepts productions of the 
meta-language, either incrementally or in batches, and stores the information. In this 
implementation the data are simply attached to the name of the operator being defined (via 
the property list). A facility could also include automatic verification of annotation patterns 
with debugging advice, and automatic documentation. 

Incremental implementations would be convenient and could even be performed 
on-line. An extreme example is a bootstrap, in which denotations may be written in the 
language defined so far (e.g. the language CGOL [Pratt 1974]). 
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HID Correctness 

We consider a forma! proof of correctness an essential, practicat'Component of the 
system; it is pointless to have automatic parsing withouragrwttwiteetnat no mistakes wit! be 
made. The cJatm we want proven; them ts^simpJe: give* an y l angu a^ definition in the 
proper representation, the parser works correctly. 

To say that the parser works con-ettly rehires a precise definition of what it should 
do. Our specifications of a metalanguage and user medel provide a formal context in 
which correctness may be rigorously defined. 

We say that sparser operating on some tanguafrdeHnttsen is correct when the 
following are true: 

I. If the expression (i.e H tree) e is written according to convention as the string w, 
then the parser will recover the tree « 

II. If the parser recovers a tree e, then the input string is in the defined language. 

Part I guarantees that any valid string of the language will be parsed correctly; part II 
assures that no incorrect strings will be parsed. 

The correctness theorem is actually a statement reding the behsvior of two functions: 
writing (mapping trees into strings) and parsing (mapping strings into trees). Both parts of 
the theorem are proven by induction, but .-ew different domains: patfrf over the domain 
of trees, and part II over strings. It is a coronary of t t f eth e oram thai the languages defined 
are unambiguous; i.e., no string can be written from more than one tree. 

From the standpoint of formal language theory, the theorem is- a proof of 
equivalence of two alternate language definition mechanisms. Agtntratlvt description is 
presented as the user model; mmtdyHc description is implicit in the parsing program. 

The proof itself is carried out in two phases. In the first, we prove a number of 
theoretical properties of the language class, ie. of the definttjonal mechanism. These 
properties are independent of any program or parsing algorithm. Given these results, the 
actual proof of the parsing program is tedious, but qu^estra^htforward 



29 



IV. FORMAL DEFINITIONS 

In this chapter we present the formal details of the language system introduced in 
Chapter III. Section IV.A presents the meta-language; the parsing program for defined 
languages is given in Section IV.C. The generative model of defined languages, given in 
Section IV. B, permits a formal statement of parser correctness, discussed and proven in the 
next chapter. 



IV.A The Meta-language 

We begin by naming the basic lexical units of our defined languages. 
Definition: A token is a single lexical symbol in a defined language. 

Notation: 

(i) Actual tokens will be represented using only upper case letters; e.g. IF, ELSE, 

+, and (. 
(ii) Lower case letters are used for meta-variables in this discussion; e.g. t (possibly 

subscripted) refers to some token, 
(iii) Greek letters represent strings of tokens; e.g. a, 0, V. 

While the token is a lexical unit, the operator is our basic definitional unit. 

Definition: An operator is a set of semantic and syntactic information, representing some 
operation. We use the meta-variable op for operators. 

Productions 

An important feature of this system is that all specifications necessary to define a 
programming language are in the form of operator definitions. A single operator definition 
is expressed in a meta-language statement called a production; productions are the only 
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statements in the meta-language. 

Definition: A production is a cluster of inf ormation which defines an operator and 
associates it with a token of the defined language. A production defining the 
operator op for the token OP must be in one of four forms depending on the operator 
type. 

OPERATOR TYPE PRODUCTION 

NILFIX "OP" <p> ,<rbp>;<donotation> 

PREFIX "OP" ~ </»> f <rbp>*«d«tKrtatton> 

POSTFIX ~ "OP" <?> ,<rt>p»,<rbp*?<dano*ation> 

INFIX ~ "OP" « <p> ,<lbp>,<rbp>;<den©tation> 

where: 

1) quotes (" ) are meta-language symbols enclosing the token being defined. 

2) ~ is a meta-language symbol denoting the presence of an argument. 

3) <p> is an optional annotation pattern, defined in the next paragraph. 

4) < I bp> and <rbp> are left and right binding powers, non-negative 
integers. 

5) <denotation> is a program which calculates the translation of op, given 
the translations of its arguments. 
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Notation: When an operator op has been defined we refer to the components of the 
production as follows: 

typelo^l is one of (NILFIX, PREFIX, POSTFIX, INFIX! . 

p lop] is the annotation pattern defined for op. 

I bp [op] is the left binding power defined for op, if any. 

rbp [op] is the right binding power defined for op. 

den [op] is the denotation defined for op. 

Aside from patterns, what we have is a simple formalism in which ordinary unary and 
binary arithmetic operators may be defined. The first part of each production is a template 
in which the defined operator is quoted and the symbol <* is a place holder for arguments. 
The left and right binding powers are stated separately, and the denotation incorporates a 
translation rule. We recall the production in Example 1 of Section HI.A in which the 
operator + is defined: 

Ex.1 ~ "+" ~ , 14,14; <denotation> 

In this case type [+] - INFIX, and lbp[+] - rbpt+] - 14. 

The optional use of annotation patterns is a distinguishing feature of this 
meta-language. A pattern allows an operator to take multiple right arguments, each labelled 
with an identifying token. In addition, tokens may be included which label no argument 
but play a purely syntactic role. 
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Definition: An annotation pattern , or simply pattern, fc an expression specifying possible 
labelled argument configurations. We use the meta-variabtes p, f, and r to 
represent patterns. A pattern p most be to one of the following forms: 

1. A 

2. "d" where d is a token 

3. "d" * where d is a token, « a raeta-symbel as above 

or, inductively, for some annotation patterns q and r, and the 
associated sets first-, first r and amt^. 

4. qr if conf n first r - $ 

5. (<r|r) if first q n f*rrt r - 4 

6. (?>* if cemr^ n ftrtf^ - $ 

Definition: A delimiter is a token used in a pattern. We use the meta-variable d (possibly 
subscripted) to represent a delimiter. 

Before defining the sets first and cont, we refer brief h; to Example 2 of Section III.A: 

Ex.2 "IF" ~ "THEN" ~ f«ELSE" ~ | XI ,€;<denotat»on> 

In this example the operator IF is defined with typefo/>] - PREFIX, and we have the 
pattern ptlFI - "THEN" ~ ("ELSE" « | A) (which will be seen to satisfy the 
restrictions). As in the operator part of a production quotes enclose the tokens, in this case 
delimiters, of the defined language, and « holds the place of an argument. 

With the exception of the restrictions imposed on cases 4, 5, and 6, these patterns are 
ordinary regular expressions with the usual interpretation; the symbols X, |, and * denote 
the empty string, pattern alternation, and pattern star closure respectively. 

Although the symbol ~ is intended to hold the place of an argument (a 
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subexpression) we will expedite our discussion of patterns by considering a language in 
which we include the symbol ~ to match itself. Thus, we will say that the string d matches 
the pattern "d", and the string d ~ matches the pattern "d" ~. Two strings which match 
pIIFl are THEN ~ and THEN ~ ELSE ~. 

Notation: When the string u matches the pattern p, we write w«<p. 

Recalling the restrictions imposed in our definition of annotation patterns, we now 
define the sets first and cont. We begin by defining our notion of first. 

Definition: first lw) - the first symbol of the string « (undefined if u - *). 

Petition: first p - U^^^I/taMwII. 

The set first is simply the generalization of first iw) to all strings matching p. Similarly, 
we have two forms of cont. 

Definition: If w*<p then contra) - U^^^ ^[firstlfl)) . 

Definition: cont- - U ww . p conf_(w). 

The set cont (w) includes any symbol which may follow w in a longer string, when 
both 0) and the longer string match p. In the context of finding a string to match a 
particular pattern, this set has the following interpretation. Assume you are scanning a 
string from left to right and have just reached the end of a string w which matches the 
pattern p. If any of the symbols of cont (w) occur next in the string, then it may be 
possible to continue scanning and find an extension of u which also matches p. Referring 
again to Example 2, we have conr_ [jp] (THEN ~) - 1ELSEI and 

cont r t pj (THEN ~ ELSE ~) - <f>. The set cont is the generalization to include all tokens 
which might occur this way, so we have cont [jpj = IELSEI . 

The sets first and cont enable us to state important restrictions on the use of 
patterns, restrictions which are directly motivated by our parsing algorithm for matching 
strings to patterns. While we can not in general prevent non-local interaction of annotation 
patterns (e.g., nested IF-THEN-ELSE expressions), it is possible to insure that there are no 
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ambiguities or unexpected results relative to a single pattern. The three restrictions prevent 
any such problems. 

The essence of the matching algorithm is as follows: look at the part of the pattern 
remaining to be matched and decide what to do next. If the pattern is X, then simply stop. 
If it is "d", then look at the next symbol in the input string. It must be d or there is an 
error. Likewise, "d" ~ means to check for d and afterwards "collect an argument". When 
the pattern is the alternation iq | r), there is the obvious problem of choosing which 
pattern to use. The decision is made by examining the next symbol and determining 
whether it is in the sets first or first r When the pattern is qr, the two patterns are simply 

matched in order. Finally, when the pattern is (?*), the next symbol is always checked for 
membership in first . If true, the pattern q is matched and the process repeated. 

The restrictions on patterns insure that the choices made by this method are always 
unique; i.e. that they are the only possible choices. Thus, in the case of lq\ r) we require 
that first n first r = <f>, no symbol may be in both sets. The problem with qr is slightly 

more subtle; the restriction here (cont n cont r - <f>) insures that the choice, whether to 
continue matching a longer string to q, or to stop and begin matching r, is always unique. 
The * operator is essentially an extension of concatenation, so the restriction on the pattern 
Iq) * is similar. It must always be clear whether to continue matching an instance of qr, or 
to go on to the next, so we require that cont- n first- - <£. Important properties of these 
restrictions, independent of any parsing algorithm, are proven in Section V.B. 

Sets of Productions 

We have now defined the local properties of a meta-language production; there is no 
other form of definitional information. A complete language definition is any set of 
productions, defining a set of operators, which satisfies minor global restrictions (to insure 
that all properties are well-defined). 

Definition : An operator is of type NUL-TYPE if it is defined without a left argument. 

An operator is of type LEF-TYPE if it is defined with a left argument. 
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Definition : A language definition D is a set of productions in which: 

1) no token OP has more that one NUL-TYPE production, 

2) no token OP has more than one LEF-TYPE production, and 

3) no token is both a LEF-TYPE operator and a delimiter. 

Conditions I and 2 allow a token to represent two operators in the special case where 
one operator takes a left argument and the other does not; i.e., when there will be no 
ambiguity. In this case, two separate operations are actually being defined, but they are 
represented by the same symbol. Such an token is both LEF-TYPE and NUL-TYPE. Context, 
i.e. the presence of the the left argument, will always make it clear which operator is meant. 
Condition 3 guarantees that the left binding power of every token is well defined, since the 
parser uses the convention that delimiters have I bp - 8. The left binding power of atl 
delimiters is by convention 8. 

IV.B Generative Model 

We have presented in Section IV.A the structure of our meta-language. A generative 
model is now defined which determines the correspondence between a language definition 
(a set of productions) and the languages defined by D (a set of token strings). The model is 
closely related to the assumptions on which the CGOL approach to translator writing is 
based: the argument relationship among operators and the syntactic conventions for linear 
representation are related but separate issues. 

We begin with the set E D of abstract expressions, collections of operators with 
specified argument relationships. We then define three properties of expressions which 
measure potential for syntactic interaction. Given these properties, we define the subset 
E'q £ E D of expressions which are grammatical; i.e., may be unambiguously represented as 
linear strings of tokens. The process of linear representation is defined as the function Up. 
mapping expressions into the set S* of strings of tokens. 
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Expression Trees 

Our basic notion of abstract expression is based on the relationship "is an argument 
of" among operators. This notion is extended by ordering and labelling each instance of 
the relationship, identifying the partkolar rote being pteyed by the argument. Thus, an 
instance of the relationship might be "is a left argument of" or "is an ELSE argument of". 

Our formal model of these expressions is a set of ordered trees with labels on both 
nodes and branches. A node corresponds to an operator whose arguments (subtrees attached 
by ordered, labelled branches) occur in a configuration appropriate to the definition of the 
operator. Examples of these trees are given in Figure I. ^Figure 'la is an expression tree 
containing only arifltmetic operators. Arguments here «re labelled *l«f i" and V i ght", 
indicating (heir rotes. Figure lb shows a conditional expression in which the test is the 
V i ght" argument and the alternative values are appropriately labelled. Figures lc and Id 
illustrate possible uses of delimiters which label no arguments. In these cases the tokens ) 
and FI are included to signal the end of the expression. ^Ve formaliie this latter technique 
by permitting labelled branches which connect to the null subtree, although we will not 
include the null tree as part of our set. 

We now define formally the set of expressions corresponding to a meta-language 
definition. Our basic requirement is that the argument configuration for each operator be 
appropriate to its definition. This requires a more precise definition of the correspondence 
between patterns and sequences of subtrees. 

Definition : The ordered subtrees e,,...,e n (n£0) , labelled d,,..,d n , match the 
pattern p iff one of the following is true: 

1. p - A and n=0, i.e. there are no subtrees. 

2. p « "d" and n-1, where «, is null and d-d,. 

3. p « "d" « and n-1, where e, is non-null and d-d], 
or, where q and r are patterns, one of the following: 

4. p - qr and 3k 8<ksn such that ej,...,e k match q, and 
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Figure 1. Expression trees 
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ei«*i n n»tch r. 

5. p - (?|r) and e|,...,e„ match f or r. 

6. p - If >* and either n-» or 3i &cfc*n such that «|, . . . , q, match 

7, and ek,|,..t,6 n match t(fJ*. 

We are now ready to define our complete set of expression trees. 

Definition: The set of expression trees E D corresponding toa language definition D 
contains the set of finite trees defined inductively by: 

Baa is: eeEp, where e is a single node with no branches attached, iff the node 
has a label op such that op is defined in 0, typolo^l-NILFIX, and 
X^plop). 

Induction: ecEp, where e is a tree with subtrees attached by labelled 

branches, iff the root node has a label op such that op is defined in D, 
each non-null subtree is in Eg, and one of the foftowing cases holds: 

1. typelo/>]-NILFIX and e has subtrees e,.. ...e n (n£0), labelled 

d, d w which match plop). 

2. type lop) -PREFIX and e has subtrees e e ,e,,...,e n (niB), 

labelled r i ght » d| , . . . , d„, where «j , . . . , e n match p lop) . 

3. type lopl -POSTFIX and e has subtrees e,.,,, e] , . . . , e n (niB), 

labelled left.dj d„, where e { , .*., e n match p lop) . 

4. type [o^l -INF IX and e has subtrees e Mt , ee, e t , . . . ,e„ (niB), 

labelled I eft , r t ght , d { d„, where e ( e n match 

plop). 
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Syntactic Properties of Expressions 

Having defined our abstract domain of expressions, we now apply syntactic 
conventions. We claimed in the previous chapter that there are only two basic types of 
syntactic interaction possible among expressions in linear form. We define three properties 
of expressions (r- index, I -index, and c-9et) which explicitly measure the tendency for 
an expression to participate in such interactions. 

The first and most common form of syntactic interaction is the association of 
intervening subexpressions (arguments). For example, in the expression a+b*c there is a 
choice, governed by convention, for the association of the subexpression b; it may either 
associate to the left (and become an argument to +) or to the right (as an argument to *). 
Since operators are subject to this interaction on either side (and binding powers may differ 
from right to left), we define two corresponding properties, beginning with the left. 

Definition : If eeEn then l-index(e) is defined inductively as follows: 

Basis: If e is a node with no branches attached, then 

l-index(e) - ». 

Induct ion: If e has subtrees then let op be the label of the root node: 

a) if op is of type LEF-TYPE (i.e. if it has a left argument), then 
l-index(e) « n'mUbplop] , l-indexle^)], 

b) otherwise (if no left argument) 

l-index(e) ■ ». 

The value of l-i ndex is a numerical measurement of an expression's resistance to 
false association to the left. If an operator has no left argument at all, then there can never 
be an intervening expression on the left, so there can never be a problem. In this case, 
I -index is «. Since a left argument may itself be an expression with a left argument, and 
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so on, this property is defined inductively over aH such subexpressions. An expression's 
resistance is only as high as the weakest "exposed" op e r at or ; 

For example, if e is the expression tiwstawn in Figure la* there are two operators 
exposed to theMt, +and *, By definition we know that 

I - i ndex [e] - minC t bp [+1 , 1 bp{*} , I top [A] ) , 

but since IbpIAT * »thts is equal to mini I bp{+ J, lbpl*ll. From this we understand 
that we have two subexpressions, A and -Aaif whid^ rn%ht be^f atse^asRiciated to the tef t. 
In an expression tree these exposed operators -an those wltfdtmaj be; reached from the top 
by following branches labeiled left down the tree. 

The situation on the right side of expressions is anatogous although complicated 
slightly by the presence of multiple right arguments. 

Definition: If e€ E n then r- tndax ( e) is defined inductively as follows: 

Basis: If* is a node with no branches attached, then 

r-index(el ! - •* 

Induction: If e has subtrees, then let o£ be the label of the root node: 

a) if there is a subtree e m and if it is non-null, then 
r-index(e) - *in[rbp lo^l.r- index (e ft )], 

b) otherwise 

r-indexlel - »; 

The value of r- index is analogous to I -index except that we now refer to e n 
instead of e wt . When there are subtrees ej,. ..,e„ which match the pattern p lopl (i.e., 
when n>8) '.then e„ is simply the last (rightmost) one. It is this subexpression which, if 
non-null, is exposed to the right and is subject to false association. For example, both 1 
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and X+l are exposed to the right in the expression of Figure lb. If e n is null, then its label 
d n is being used as a purely syntactic token to indicate that there are no more arguments to 
the right. In this case there is no possibility of false association, so the value of r- Index 
is ». Examples of this are the expressions of Figures 1c and Id. Now if the annotation part 
of the expression is entirely null (i.e. n-B), then the expression is of the ordinary arithmetic 
variety (e.g., Figure la). In this case, e„ refers to the right argument e e , if there is one, 
and r- index is the exact counterpart of I -index. 

We turn now to the second type of syntactic interaction, the generalized dangling 
ELSE problem. We recall that our pattern matching algorithm (i.e. for collecting right 
arguments) will continue to gather arguments as long as possible. We are interested in the 
case where the pattern p has been matched (say by the string w) and there is a choice 
whether to continue. Any token for which this is possible is by definition in cont _(«). 
Looking at our standard example where pIIFl - "THEN" ~ ("ELSE" ~ | X), we have: 

conf p[IF] ( THEN ~ ) - (ELSE!. 

This tells us that if the operator IF has so far collected the token THEN and a following 
argument then the collection may stop, but if ELSE appears next in the input string, it will 
be included. When we deal with general expression trees, this problem can be caused either 
at the top level (by the pattern of the topmost operator) or at lower levels (in exposed 
rightmost arguments), so the property c-set is defined recursively, similar to r- index. 
The c-set of an expression is the set of all delimiters which would be incorrectly included 
if placed after the expression in linear form. 

This definition requires the property cont to be defined on an ordered set of subtrees, 
rather than the on strings of the original definition. The correspondence is quite 
straightforward: a null subtree e f with branch labelled dj corresponds to the single symbol 
dj, and a non-null subtree ej labelled dj corresponds to the string dj «. As is proven in 
Lemma II of Section V.B, this translation does not affect the definition of cont; the symbol 
•v can never be in the set. 
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Definition: If ecEothen c-wt(«) is defined inductively « f ©flew* 
Basis: If e is a node with no branches attached, then 

c-swtfef - cwtfp^jCA). 
Induction: If e has subtrees, then let o£ be the label ctf the root node. 

a) if there is a subtree <a» and if It is non-null, then 
c-set(e) - cimr-r^Aj •(t|*.... t f»> M c-Mtf**!. 

b) otherwise 

c-sstie) ■ conl_f-Ai (e|, . • • , Cg). 

Grammatical Expressions 

We now use these three syntactic properties to restrict our set Eg of expressions by 
eliminating those which permit unwanted syntactic i nt er actions . 

Definition: e€E D is grammatical iff one of die following is tme-. 

Bas i s: e has no branches attached, or 

I nduc t i on: e is a tree with root node labelled op and with subtrees satisfying: 

0) each non-null subtree is grammatical, 

1) r-indexfe^,,) i IbpltyJ, if there is a subtree e^ 
2)rbplopl < l-index(ej), for 0Si3n, when «j is non-null. 

3) dj « c-setiej.,), for lsisn, when e { is non-nuW. 
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This definition allows us to build trees while watching for syntactic problems. The 
restrictions correspond to the informal rules described in Section III.B; each restriction may 
be understood as the prevention of unwanted syntactic interaction. Restriction 1 covers the 
use of an expression as a left argument; it insures that the whole expression will be treated 
as the argument, not some exposed fragment. For example, this restriction would prevent 
the use of the expression in Fig. la as a left argument to the operator t, since the 
subexpression C would incorrectly become the left argument of t. Restriction 2 is the 
equivalent on the right side. Restriction 3 insures that no delimiter will be improperly 
included with a subexpression; e.g., don't use an IF-THEN expression as the THEN argument 
to an IF-THEN-ELSE expression. 

Definition : E'o - leeEol e is grammatical! . 

Our defined language will be based on only the expression trees which are 
grammatical. Ungrammatical trees may be easily fixed by the addition of some operator 
with bracketing properties, typically the semantically null operator (. For example, the 
expression shown in Figure 1c would, given a reasonable definition, have Ibp - rbp - 8 
and cont = <f>; i.e., it is syntactically secure. 

The Writing Function 

Now that we have eliminated syntactic problems from our set of expressions, we may 
use a trivial writing function. 

Definition : The writing function U D is defined recursively on the set Eq as follows: 

If eeE D then: U D (e) = a OP d,Y,. . .d n V n where: 

OP is the token naming the operator at the root node of e. 

oc = UrjCe^ff) if e tef , exists, A otherwise. 

(3 - U D (e e ) if e exists, \ otherwise. 
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dj - the label on tree ej for lsi<n. 

Vj - U D (ej) for l<i£n, when e, is non-null, A otherwise. 

The linear representation of trees defined by Up uses a very simple convention. An 
argument is preceded by its label, with two important exceptions: the labels left and 
right are implicitly represented by juxtaposition with the operator. 

The Defined Language 

Finally, the defined language S D is simply the linear form of the grammatical trees. 

Definition : Given a set D of productions, the defined language is So, where 

S D «U D IE' D ). 

IV.C The Parsing Program 

We present the parsing program in two parts; in addition to the actual program 
(which we will view as a function from strings into expression trees) we give a specification 
of the internal representation required for meta-language productions. A program which 
automatically converts a meta-language production to this internal form is called the 

meta-language processor. 

The Meta-language Processor 

There is virtually no processing of the information given in the productions of the 
meta-language. It is simply broken into the natural categories, converted into a standard 
LISP representation, and attached to the property list of the defined token. The categories 
and their property list names are: 



"?~ : ..v. ■•■■■ ■■ 
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1. Type (eg. INFIX) NUL-TYP, LEF-TYP 

2. Annotation Pattern NUL-PAT, LEF-PAT 

3. Left binding power LBP 

4. Right binding power NUL-RBP, LEF-RBP 

5. Denotation NUL-OEN, LEF-DEN 

Since it is possible to have two operators for the same token, one with a left argument and 
one without, the two sets of data will be separately named so they may coexist and be 
independently retrieved from the property lists. The one exception is the left binding 
power, since it is irrelevant for NUL-TYPE operators. Any token used as a delimiter, 
however, will have its left binding power set to 8. The denotation properties will not be 
used in this implementation, since it will. only parse and not translate. 

The definitional information will be represented as LISP data in the following forms: 

Type: The NUL-TYP and LEF-TYP properties are simply the appropriate names. Thus 
NUL-TYP may be either NILFIX or PREFIX, and LEF-TYP may be either 
POSTFIX or INFIX. 

Left binding power: The property is a non-negative integer. 

Right binding power: The property is a non-negative integer. 
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Annotation Pattern: The representation of a pattern p is the Hst repr [jri defined 
recursively by: 

I. If p - * then repr [*1 - (LAMB). 

2- If p - H d H then repr [#3 - Cttl , where d is the token. 

3. If p m. M d M ~ then repr [?3 - (d ARG) , where d is the token, 
or, if (f and r are patterns, and repr Efl and repr {j*} their representations: 

4. If p m qr then repr [pi - (CONC reprl«rl reprtrl). 

5. If p - («rjr) then repr tpJ - UfftlOH repr [qrl reprtrl). 

6. If p - (*)* then renr [pi - (STAU reprl^). 

Since this information is on property lists, it is globally available to the parsing 
program; a request for one of these properties will have the same value independent of the 
particular environment from which it is made. For the purposes of proof, we give the 
following axioms which formally specify the operation of the meta-bmguage processor. 

Axiom 1: If the token OP is defined in as a niifix operator, then 

(a) (GET 'OP *NUL-TYP) - N1LFIX 

(b) (GET 'OP 'NUL-PAT) - repr [»£<#]] 
tc) (GET 'OP 'NUL-RBP) - rtoplofJ 
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Axiom 2: If the token OP is defined in D as a prefix operator, then 

(a) (GET 'OP 'NUL-TYP) - PREFIX 

(b) (GET 'OP 'NUL-PAT) - reprlplopU 

(c) (GET 'OP 'NUL-RBPJ - rbplop] 

Axiom 3: If the token OP is defined in D as a postfix operator, then 

(a) (GET 'OP 'LEF-TYP) - POSTFIX 

(b) (GET 'OP 'LEF-PAT) - reprlplopll 

(c) (GET 'OP 'LEF-RBP) - rbplopl 

Axiom 4: If the token OP is defined in D as an infix operator, then 

(a) (GET 'OP 'LEF-TYP) - INFIX 

(b) (GET 'OP 'LEF-PAT) - reprlplopU 

(c) (GET 'OP 'LEF-RBP) - rbplop 

Axiom S: If the token OP is used as a delimiter in any production in D, then 

(5) (GET 'OP 'LBP) = 

It may now be seen how our global restrictions on sets of productions insure that all 
of these properties are well-defined. Properties NUL-TYP, NUL-PAT, NUL-RBP, and 
NUL-DEN can only be determined if a nilf ix or prefix operator is defined for OP, but we 
only allow one such production per token. Similarly, LEF-TYP, LEF-PAT, LEF-RBP, and 
LEF-DEN are well-defined. LBP may only be determined if a postfix or infix operator is 
defined for OP (in which case only one such definition is allowed) or if it is used anywhere 
as a delimiter (in which case the LBP is 0, no matter how many times it is used). A token 
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may not, however, be both. 

The Parsing Program 

We present below the LISP code for a straightforward parser implementation. The 
parser returns the expression tree in a simple list representation defined below; an extension 
to the full translator would have the arguments passed to the denotation, rather than being 
assembled into a list. 

Expression Tree: The representation of a tree eeE D is the recursively defined list: 

reprEe] - (OP r Mi r rjfM r-j ... r n ) where 

r^f, - (LEFT repr [e wt ] ) if C|.f, exists, otherwise non-existent 

r rigM - (RIGHT repr[e ]) if e exists, otherwise non-existent 

Tj - (dj reprtej]) if ej is non-null 

r-j - (dj) if e; is null 

Several prominent features of this program should be kept in mind; it was written for 
perspicuity and convenience of proof. There are therefore no global variable references; 
for each subroutine the input stream is passed as an argument and returned as a value. 
The result is a program which is approximately twice as long and much less efficient than it 
could be. The main problem is that passing the input string as an argument often requires 
that the same expression be evaluated more than once. This problem could be easily solved 
but would result in rather more obscure code; efficiency has been sacrificed for clarity. An 
equivalent but efficient program could be proven correct by proving its equivalence to this 
one. Such a proof should be considerably shorter than an original proof of correctness as 
given here. 
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The Basic Parsing Program 



(DEFUN PARSE (RBP STRING) 

(ASSOC RBP (NUL-TYPE STRING))) 



(DEFUN ASSOC (RBP STATE) 

(COND HLESSP RBP (GET (CAOR STATE) *LBP)) 
(ASSOC RBP (LEF-TYPE STATE))) 
CT STATE))) 



This is the top level control structure of the parser. The function PARSE receives as 
input a right binding power and a list of symbols, the string in Sato be parsed. The status 
of the parse is contained in the variable STATE which is passed and returned among the 
procedures. STATE is always a list whose first element is the representation of the 
expression (tree) parsed so far, and whose remaining elements are the unparsed input string. 
Given that an expression has been parsed, the function ASSOC (not the standard LISP 
function ASSOC) decides whether to give it as a left argument to the next operator in the 
string (by calling ASSOC recursively), or to return the current state. 

The function NUL-TYPE collects the arguments for the next operator in the string, on 
the assumption that it is nilfix or prefix. It in turn calls N I LF IX or PREFIX to handle the 
separate cases. The function LEF-TYPE is similar, except that the expression parsed so far 
is assumed to be the left argument to the next operator in the string. The subroutine FIND 
handles the collection of all annotation tokens and arguments; it uses the functions 
LAMBDA-P (predicate for null string membership in a pattern) and FIRST (the set first 
previously defined). 
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Functions to Process NUL-TYPE Operators 



(DEFUN NUL-TYPE (STRING) 

(COND ((NULL (CDDR STATE)) ERROR) 

((EQ (GET (CAR STRING) 'NUL-TYP) 'NILFIX) 
(NILFIX (CAR STRING) 
(CDR STRING) 

(GET (CAR STRING) 'NUL-RBP) 
(GET (CAR STRING) 'WUL-PAT) ) ) 
((EQ (GET (CAR STRING) 'NUL-TYP) 'PREFIX) 
(PREFIX (CAR STRING) 
(CDR STRING) 

(GET (CAR STRING) 'NUL-RBP) 
(GET (CAR STRING) 'NUL-PAT))) 
(T 
(NILFIX (CAR STRING) 
(CDR STRING) 

e 

'(LAMB))) )) 



;end of input 

; operator 
;unparsed string 
; rbp lopl 
; P lopl 

•as above 



; default case 
; variable or 
; constant 



(DEFUN NILFIX (OPERATOR REST RBP PAT) 
(CONS (APPEND (LIST OPERATOR) 

(CAR (FIND RBP (CONS NIL REST) PAT))) 
(COR (FIND RBP (CONS NIL REST) PAT)) )) 



(DEFUN PREFIX (OPERATOR REST RBP PAT) 
(CONS (APPEND (LIST OPERATOR) 

(LIST (LIST 'RIGHT (CAR (PARSE RBP REST)))) 
(CAR (FIND RBP 

(CONS NIL (CDR (PARSE RBP REST))) 
PAT))) 
(CDR (FIND RBP (CONS NIL (CDR(PARSE RBP REST) ) )PAT) > ) ) 
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Functions to Process LEF-TYPE Operators 



(DEFUN LEF-TYPE (STATE) 

(COND (<NULL (CDDR STATE)) ERROR) 

((EQ (GET (CADR STATE) 'LEF-TYP) 'POSTFIX) 
(POSTFIX (CAR STATE) 
(CADR STATE) 
(CDDR STATE) 

(GET (CADR STATE) 'LEF-RBP) 
(GET (CADR STATE) 'LEF-PAT) )) 
((EQ (GET (CADR STATE) 'LEF-TYP) 'INFIX) 
(INFIX (CAR STATE) 
(CADR STATE) 
(CDDR STATE) 

(GET (CADR STATE) ' LEF-RBP) 
(GET (CADR STATE) 'LEF-PAT) )) 
(T ERROR) )) 



;end of string 

;left arg 
; operator 
;unparsed string 
; rbp topi 
IP top) 

;as above 



;no left def. 



(DEFUN POSTFIX (LVAL OPERATOR REST RBP PAT) 
(CONS (APPEND (LIST OPERATOR) 

(LIST (LIST 'LEFT LVAL)) 
(CAR (FIND RBP (CONS NIL REST) PAT))) 
(CDR (FIND RBP (CONS NIL REST) PAT)))) 



(DEFUN INFIX (LVAL OPERATOR REST RBP PAT) 
(CONS (APPEND (LIST OPERATOR) 

(LIST (LIST 'LEFT LVAL)) 

(LIST (LIST 'RIGHT (CAR (PARSE RBP REST)))) 

(CAR (FIND RBP 

(CONS NIL (CDR (PARSE RBP REST))) 
PAT) ) ) 
(CDR (FIND RBP (CONS NIL (COR(PARSE RBP REST) )) PAT) )) ) 
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Annotation Argument Processor 

(DEFUN FIND (RBP STATE PAT) 

(CONO ((EQ (CAR PAT) 'LAMB) ;P-* 

STATE) 

;p-qr 

((EQ (CAR PAT) 'CONO 
(FIND RBP (FIND RBP STATE (CADR PAT)) (CADOR PAT))) 

iP~[q\r) 
((EQ (CAR PAT) 'UNION) 
(COND ((MEMBER (CADR STATE) (FIRST (CADR PAT))) 
(FIND RBP STATE (CADR PAT))) 
((MEMBER (CADR STATE) (FIRST (CADDRPAT))) 

(FIND RBP STATE (CADOR PAT))) 
((LAMBDA-P PAT) 
STATE) 

(T ERROR))) ; neither alternative matches 

;p-iq)* 
((EQ (CAR PAT) 'STAR) 
(CONO ((MEMBER (CADR STATE) (FIRST (CADR PAT))) 

(FIND RBP (FIND RBP STATE (CAOR PAT)) PAT)) 
(T STATE))) 

iP" a 
((AND (NULL (CDR PAT)) (EQ (CAR PAT) (CADR STATE))) 
(CONS (APPEND (CAR STATE) 

(LIST (LIST (CADR STATE)))) 
(CDDR STATE))) 

;p-"d M ~ 

((EQ (CAR PAT) (CADR STATE)) 
(CONS (APPEND (CAR STATE) 

(LIST (LIST (CADR STATE) 

(CAR (PARSE RBP (CDDR STATE)))))) 
(CDR (PARSE RBP (CDDR STATE))))) 
(T ERROR))) ; missing token— (car pattern) 
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Pattern Processing Functions 



(DEFUN LAMBDA-P 
(P) 
(COND 



((EQ (CAR P) 'LAMB) T) 
((EQ (CAR P) 'CONO 

(AND (LAI1B0A-P (CADR PI). 
((EQ (CAR P) 'UNION) 



; P-* 
ip~qr 
(LAHBOA-P (CADOR P)))) 

iP-Cf |rl 



(OR (LAI1BDA-P (CADR PI) (LAMBDA-P (CADDR P)))) 
((EQ (CAR P) 'STAR) T) \p-lq)* 

(T NIL))) iP- M d" or "d" 



(DEFUN FIRST 
(P) 
(COND 



((EQ (CAR P) 'LAMB) NIL) 
((EQ (CAR P) XONC) 
(APPEND (FIRST (CADR P)) 

(COND ((LAMBDA-P (CADR P)) 

(T NIL)))) 

((EQ (CAR P) 'UNION) ;P 

(APPEND (FIRST (CADR P)) (FIRST (CADOR P)))) 

((EQ (CARP) 'STAR) (FIRST (CADRP))) ;p 

(T (LIST (CAR P))))) ;p 



JP-* 
\P-ir 

(FIRST (CADDR P))) 



(«r|r) 



iq)* 

"d" or "d"~ 
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V. COSRECTHESS 

Using the definitions presented in Chapter IV, we are bow prepared to formally state 
and prove the notion of correctness discussed irtformafly in Section IIIJD. In the first section 
of this chapter we state our mam result, the PWISE theorem, and discuss three important 
corollaries which embody more closely our intuitive notions of correctness. Section V.B 
presents a number of preliminary lemmas, dealing primarily «Mi properties of annotation 
patterns in our metalanguage. These results are theoretical properties and are completely 
independent of the parsing algorithm. Sections V.C and V.® contain the proofs of parts I 
and II, respectively, of the PARSE theorem, these ftteoreir« a«l^ ©« itraightforward 
since the interesting theoretical results are separately proven. 



V.A Formal Statement 

We begin a formal statement of correctness by recalling the user-oriented description 
of a defined language. For any set of metalanguage productions 0, the language S D G 2* 
defined by is Sq - U D (E' D ) , where U D is the writing function and E' is the set of 
grammatical expression trees. The parser for the language, constructed by the algorithm of 
Section IV :C, is represented by the function Pq. Tbts function maps strings of 2* into 
expression trees (defined in I V.C). The function P is partial; when we write Po<8) - e, 
we mean that the parser, when given the input string S, halts erroHfree and returns e. We 
state now our main result. 

PARSE THEOREM: 

I. VD VeeE'o (P D (U D (e)) - e) 

II. VO VscS* (P (S) halts error-free * 6 e S ) 

For the rest of this chapter we assume that D refers to some language definition expressed 
in the meta-ianguage of Section IV.A; i.e., we drop the "for afl 0". 



•■■'•W^^?sii^A 
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We examine now the sufficiency of the result relative to general notions of 
correctness in the form of corollaries. The first is that a translator should be an acceptor for 
the language S in the ordinary sense, the translator should hah error-free exactly when 
given a sentence in the language Sq. 

Corollary 1 (Acceptor): 

VSeS* (P D (s) halts error-free «» 5 € Sq) 

Proof: One direction simply restates part II of the PARSE theorem. Now assume 8 € Sq. 
By definition there is some e e E' D such that S - U D (e). Part I says 
P (U D (e)) - P D (5) - r, i.e.; PARSE hahs error-free! 

We also expect that the translator, when it halts error-free, returns a valid parse of 
the input string. 

Corollary 2 (Parser): 

V5eS (P D (S) e E' D a U d (P d (S)1 - S) 

Proof: Assumes e S D . Then there is some e € E' D such that S - U D (e). By Part I we 
knowP D (s) - P (U (e)) - e e E'q. Furthermore, since P (5) - e, we have 
U D (P D (s)) - U (e) - S.I 

We note that Corollary 2 only guarantees the output of seme valid expression tree, or 
parse, for each input. We have not proven that such a parse must be unique, i.e., that the 
language is unambiguous. Ambiguity is a property of a language and its means of 
definition, not of a particular parsing scheme. 

Corollary 3 (Uniqueness): 

Ve,e'eE' D <U„(e) -U D (e1 * e-e') 

Proof: Assume U (e) - U D (e') fore.e' c E'q. Since the parser is a function, 
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P D (U D (e)) = P D (U D (e'). Then by Part I we have e = P D (U D (e)) 
- P D (U D (e')) = e'.l 

Although not strictly a property of the parser, we treat this property here for completeness 
and convenience of proof. 
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V.B Preliminary Lemmas 

This section formally states and proves a number of necessary properties of our 
definitional system. Some are merely restatements of definitions and are included for 
uniform reference; the majority are derived properties which are essential to the program 
proof. The final two lemmas are correctness proofs of two simple utility programs, 
LAMBDA-P and FIRST. 

We begin with binding powers. 

Lemma 1 (Binding powers): 

(a) If the token op is defined as an operator in D, then rbp lopl £ 8 and I bp topi i. 

if defined. 

(b) If the token d is used as a delimiter in D, then I bp Id) - 8. 

(c) For any eeE D , I -index [el > 8 and r- index [el > 8. 

Proof: Parts (a) and (b) are immediate from the definitions. Part (c) uses part (a) and 
follows by trivial induction over the definitions of I -index and r-index.l 

The following lemmas describe properties of annotation patterns. Although patterns 
ultimately determine sequences of labelled subtrees, these properties will be stated and 
proven in terms of a simpler but equivalent language. We say that a pattern may be 
matched by strings of symbols, where the symbols include the special symbol ~ and tokens 
of the defined language. The same convention was used in the discussion of first and cont 
in Section IV. A. The correspondence between the strings used here and the ordered sets of 
labelled subtrees is straightforward. The symbol ~ can only follow a token in strings which 
match patterns. A token d followed by ~ in one of these strings corresponds to a non-null 
subtree labelled d. A token d not followed by ~ corresponds to a null subtree labelled d. 
Lemmas 4 and 11 guarantee that the symbol ~ is invisible; i.e., it plays no role in any of the 
results presented here. The results apply equally to sequences of labelled subtrees. 

For convenience we restate here an essential feature of the definition of patterns, the 
restrictions on the inductive use of pattern concatenation, alternation, and star closure. 
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Restrictions (Definition of patterns): Let p,q, r be patterns. 

Rl. If p - qr then cont n ftrst r - <j>. 

R2. If /> - iq\r) then first q n /tof r - <p. 

R3. If p - (qt)* then cont n /tor - <p. 

Because our parsing algorithm continually requires us to treat X as a special case, we 
would like to know some of the null-string properties of patterns. 

Lemma 2 (A predicate): Let p, q, r be patterns. 

(a) If p - qr then X«p iff A«<<r a *«<r. 

(b) If p = (q\r) then X«p iff X-<<r v A-<r. 

(c) If p - (?)* then A«<p. 

Proof: Immediate from the definition of match.l 

Lemma 2 is the basis for the algorithm used by LAMBDA-P, which calculates whether or not 
A matches a particular pattern. 

The next lemma is relevant to the computation of the set first for a pattern. 

Lemma 3 (first)-. Let p,q,r be patterns. 

(a) If p = qr then 

1. if \<*q then first = first 

2. if Axqr then first - ftrst U first r 

(b) If p - (qr|r) then /frJ'p " /*"'» u f ir *V 

(c) If p - (or)* then first p - /tor . 
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Proof: Immediate from the definitions of first and match.l 

The parser look at the first symbol of a string in order to decide how to begin 
matching the string to a pattern. The next lemma guarantees that the parser never looks at 
~ when deciding; i.e., that the first symbol is always some delimiter and not part of a 
subexpression. 

Lemma 4: If p is a pattern, first contains only tokens (not ~). 

Proof: By induction on the definition of a pattern. If p - X, "d", or "d" ~ then 
ftrst p - <p. Id! , or Id! respectively. If p - qr, i<r\r), or (?)*, then by 
Lemma 3 and induction first contains only tokens.1 

We turn now to properties of the set cont. We begin with its value relative to the 
null string. 

Lemma 5: If p is a pattern and \«p then cont (A) - first . 

Proof: From definitions, 

cont p M - IV^^I/taMflM - U^ p ^ifirstW)) - first p .l 

This result has a strong implication for star closure; restriction R3 prevents the use of star 
closure on nontrivial patterns matched by the null string. 

Lemma 6: If p is a pattern and conf _ n first « <f> and X«p then only \ matches p. 

Proof: Assume i\«< p. By Lemma 5 we have first » cont AX) £ conr_. Since we assumed 
cont n first - <f), it must be that first - <£, implying that no string other than A can 
match p. I 

The next lemma is a preliminary result to be used in the proofs of Lemmas 8 and 10. 
It deals with the way in which a string can match a concatenated pattern. 
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Definition: The string u is a prefix of string w' iff «' - wet for some string a; if a # X 
then w is said to be a nontrivial prefix of w'. 

Lemma 7 (Ambiguity): Let q and r be patterns. If 

(1) cont n first - </>, 

(2) w =■ Wi« 2 where «|«<qr and **w 2 «<r, 

(3) «' - w/w^ where W|'«<qr and u 2 '•</•, and 

(4) w is a prefix of w', 

then o)| ■ <»))'. 

Proof: Since w is a prefix of w', exactly one of the following cases must hold: (i) <■>] is a 
nontrivial prefix of «|', (ii) «|' is a nontrivial prefix of W|, or (iii) W| «■ «|'. We will 
show that (i) and (ii) do not hold. 

(i) wj' - w ( a for some « »• X By definition, /irrt (a) e can/, (uj) C conf . Since 
w 2 * X, we also have /inr(a) * /i!rj;Hu 2 ) e /iJrrf r , violating condition (I). 
(ii) «! » u | 'a for some o * X. It cannot be the case that u 2 ' - X because w - «|0> 2 is 
a prefix of w'. By symmetry this reduces to case 1.1 

It is a corollary of Lemma 7 that when a string u matches p - qrr, it matches in only one 
way. Applying Lemma 7 inductively, we get the same implication for star closure. 

Lemmas 8, 9, and 10 describe the contents of the set cont relative to concatenation, 
alternation, and star closure. Since these are the essential lemmas for the actual program 
proof, they are stated in terms of specific strings; i.e., they describe cant Aw) rather than 

cont . The lemmas are intended to directly imply the correctness of the pattern matching 
part of the parsing algorithm. For example, Lemma 8 guarantees that concatenated patterns 
may be dealt with locally, one at a time. 
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Lemma 8 Icont): Let p - qr and w - &>|« 2 with m,-*^ and w 2 «<r, then 

(a) if u) 2 *A then cont lu) - conf r (w 2 ) 

(b) if u 2 =A then cont (w) - conr r (w 2 ) U conf -(wj). 

Proof: (a) 

2 Wehaveeon« r ( W2 ) - U w ^ r ^ x iflrst(0)) C U^^^I/lwMJlll 
= cont p (w) by definition and since o> 2 (3«r implies that vt^^fi^p. 

£ By definition «mf p (w) - U u /j ^^ l/iriM(3) J . If «0 =w,w 2 <3«<p, then let 
w{3 - w' = « 1 '« 2 / where wjVqr and « 2 '«<r. Since u is a prefix of a' and u 2 *\ we 
havewj = w j ' by Lemma 7. Then w 2 ' - w 2 (J, so first i(i) e conr r (w 2 ). 

(b) 

2 As in part (a) conr r (u 2 ) Q cont.ioi). In addition, 

contqiui) = U w p^^l/torCfJH £ U^, ^1/toMfl) J - «mf p (fc>) since 
Wi(3«<Qr implies that U|(3=wiX(3«<p. 
£ By definition conr p (w) - U w /j«:n ,rt I •x'/ i^r( ' 3,, • If wfl-uiw^-ttifl^p then let 
<n)]P=w'=W|'w 2 ' where c*> j '«*< qr and w 2 '«<r. We consider the three cases of the relationship 
between W] and w 2 '. 

(i) If o> | is a nontrivial prefix of «i', then first (0) e /irrtj.. 

(ii) If fc),' is a nontrivial prefix of «|, then « 2 '*A. But then first iw 2 ') e amf-(wi'). 
Since first i w 2 ') e /*r.sf r this violates Rl. 

(iii) If a), - (i),', then = w 2 ' and first ((3) e /im^.. By Lemma 5 
/irjf r - conr r (M - conr r (<i» 2 ).| 

Lemma 9 (cont): Let p.qr.r be patterns and p - (<r|r). 

(a) If w=X then conr p (w) - first U fjrrf r 

(b) If A*u«< (]r then cont- (u) - conf_(w). 

(c) If **&><>< r then conr(o)) =• conf r (w). 

Proof: (a) By Lemma 5 cont (A) - first , and by Lemma 3b first. - /*«*_ U /tor,.. 
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(b) Claim <afi«p iff «#•<? Clearly «0*«r implies «0*<p. Conversely, if «0*<p then either 
w|3«<qr or «0«r. If «<l-<r then ffrjf («0) e first r But since »»*X, 

/Krir(w^) - /Krrt(w) e /irrt violating R2, so /«£**; We conclude that 
«"V w) - U^p^l/iwMtDI - U^^^ l/h* HUT - »n/ f C«>. 

(c) Similar to part (b).l 

Lemma 16 (cent}. Let p, <r, r be patterns and p - If)*; 

(a) If w-X then co7tf p («) - first ^ 

(b) If X«*«-<p where «-« ]...«„ for nil and Wj«<f for Isisn, then 

omr_(w) - conf-(w n ) U first-. 

Proof: 

(a) If «.-x then cont i X) « f lTSt p " f int <t b T Lemmas 5 and 3c 

(b) By Lemma 6 we need only consider 2 cases: either <ri* matched bfonh; X, or qr is not 
matched by X a* aH. Since «»«X we assume the second: case; where**f ; 

a Wehav*/ir^ - U^ ^firstiQ.U S, U^^^ .I/tot'iffM - cen^Cw ), since 

0-<qr and w-<|> implies that «0«<p. We have also cwtf^lw*) - ^t ft ji-.«f- t 0*x*7* rrt ^'' 
s ^(i>j3"<p 0*A^ r5r ^" * ""^b*"*'** state-«« l ^**C'aBicl^«*|...-.-«^_i'*«|» , itnpaes that 

(ji0-<p. 
C By induction on n. 

n-1: By definition conf p (w,) - U w g^ ^Ifirrt (#)t. Let «,£-« '-«]'.. .» m ' 

where «jVqr for l<is». Since «'*X, nil. We consider the three possible relationships 

between wj and «|'. 

(i) If w,' is a nontrivial prefix of w, then »i2, and since u 2 VX (recalling that X*qr), we 

have/*r.jflu 2 ') e omr («,')' £ cant q . But also /*rrt Ittj'* e /*«< f violating R3. 

Contradiction. 

(ii) If «,««)' then, since f3*X, we/ have *i2. Again » 2 '»*X so 

/*rjMf3) - first {u^ e /irrt^. 

(iii) If «| is a nontrivial prefix of »j' then first KQ>) e conf _<W|). 

n>l: Assume the result for n-1. If w»W| . . . «„ then by definition 

cont p iu) - U w/3 ^ p<3 ^ l/irjf (fl)l. If <a(s«p then let «£-«'««,'. . . w m ' where Wj'-<<r 

for lsi s». Apply Lemma 7 as follows. We have « is a prefix of «'. Decompose w as 
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w = Wj co 2 . . .w n where <0|«qr and a> 2 . ..w n «<p. Likewise w' « «j' « 2 '. .. w m ' where 

«i'°<qr and (i} 2 '. . . «„,'•</>. Since /inr - first v by Lemma 3c, we have 

cont n ffrjf - <p (using R3). So by Lemma 7, W| -<■>,'. We now have 

w 2 . . •w n (3=to 2 '. . . u m '«p, so first i(i) e conf (w 2 . ..w n ). By induction we have 

conf_(w 2 . . . w n ) - cont -(«„) U first .t 

Our final lemma about the set conf is the counterpart of Lemma 4 for the set first. 

Lemma 11: If p is a pattern, conf contains only tokens (not ~). 

Proof: By induction on the definition of patterns. If p = X, "d", or "d" ~ then 

cont = <p. If p m qr then by Lemma 8 and induction. If p * lq\r) then by Lemma 
9, induction, and Lemma 4. If p - (qr)* then by Lemma 10, induction, and Lemma 4.1 

The final two lemmas are proofs of the pattern utility programs LAMBDA-P and 
FIRST. Their correctness will follow almost directly from Lemmas 2 and 3. 

Lemma 12 (LAMBDA-P): Let p be a pattern. Then (LAMBDA-P p) - T iff X«p. 

Proof: By induction on patterns. The program deals with five exclusive cases. When p-X 
the answer is T. When p="d" or "d" ~, then the answer is NIL. When p=qr the 
answer is (AND (LAMBDA-P or) (LAMBDA-P r) ), by induction and Lemma 2. Similarly, 
when p=lq\r) the answer is (OR (LAMBDA-P q) (LAMBDA-P r)), and when p-iq)* 
the answer is T.I 

Lemma 13 (FIRST): Let p be a pattern. Then (FIRST p) - a list containing the 
symbols of first . 

Proof: By induction on patterns, the same five exclusive cases as the previous lemma; we 
use now Lemma 3 inductively.. When p«X then NIL. When j>«"d" or "d" ~, then (d). 
When p=qrr then (APPEND (FIRST q) (COND ( (LAMBDA-P q) (FIRST r)))), where 
X«p is determined by Lemma 12 When p=>(qr|r) then 
(APPEND (FIRST q) (FIRST r)). Finally, when p-iq)* then simply (FIRST <r).l 
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V.C Parse Theorem I 



We present now the proof erf the first PW*SE theorem stUed in Section V.A: 

VeeE'n«^ (%(«*-! ~ •> 



where E' D is the set of expression trees defined formal^ inSeclioo IV.B, Uo is the writing 
function of Section IV.R, and the parsing function P^ corresponds to the LISP program 
PARSE presented in Section I V.C. The program PARSE accept* a* input a list of tokens; its 



value, if it halts error-free, is the LISP; 



Section IV C. The final token m any input strmg to P/u^£ is^the spemJ termination 
symbol 4; the left, binding power of this sp ib^is. a si umed to be -fc the«*n%r non-negative 
left binding power used. In terrm of the ptogram the theorem it 

PARSE Theorem L If e e E' and 6 - %le) then 



(PARSE -1 (S 



of an expression txee, as defined in 



41) • (repHel 4) 



Its inductive proof requires a restatement in the foHcwwng more general form: 
Theorem 1.9: If for some e and rbp we are given 



CI. ecE'pandS - t,...t K - II&1*) for tesl 



C2. r- index [el £ lbp[t k4 ,l. 



C3. t k ,, « c- set [el. 



C4. rbp < I -index [el. 



C5. rbp fc lbp[t h .,l. 



then 



[PARSE rbp (8 t^,.,.)) - (reprie] t k#J ...). 
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PARSE Theorem I is a special case of Theorem 1.9 by the following argument. CI is the 
given, letting S - t t . . . t k . The symbol t M is H which has a left binding power of -1; 
from Lemma I we know that r- index [el 2: 8, so C2 is satisfied. For condition C3 we 
observe that since H is not is the defined language, it cannot be in c-set (el. As above, we 
know that I - i ndex [e] > B, so C4 is true. Finally, we know that rbp - IbpHl - -1, 
satisfying C5. 

Outline of Proof 

Theorem 1.9 is the last in a sequence of nine subsidiary theorems, which correspond 
roughly to the subroutines of the program PARSE. Theorem 1.1 (FIND) covers the correct 
parsing of the annotation part of an expression. Theorems 1.2, 1.3, and 1.4 (NILFIX, 
PREFIX, and NUL-TYPE) deal with NUL-TYPE operators, and Theorems 1.5, 1.6, and 1.7 
(POSTFIX, INFIX, and LEF-TYPE) similarly treat LEF-TYPE operators. Theorems 1.8 and 1.9 
(PARSEa and PARSEb) state the top level behavior of the PARSE and ASSOC programs, the 
essential part of the parsing algorithm; Theorem 1.8 corresponds to the recursive parsing of 
left arguments and Theorem 1.9 to right arguments. Each theorem guarantees that if its 
arguments meet certain conditions, then the result of the corresponding subroutine has the 
desired property, i.e., that the subroutine operates correctly. With the exception of the 
language definition attached to property lists, as described in Section IV.C, each subroutine 
uses only values given as explicit arguments. No side-effects need be mentioned since the 
given implementation of PARSE contains only local variables. 

The theorems are proven using simultaneous induction over the set E'q of expression 
trees. At each level of induction, they may be proven sequentially according to their 
dependence by subroutine calls, as diagrammed in the partial ordering of Figure 2. In this 
figure the proof of the upper theorem of a linked pair depends on the lower theorem; the 
inductive use of Theorems 1.8 and 1.9 is indicated at the bottom of the graph. For instance, 
Theorem 1.4 depends on Theorems 1.2 and 1.3, which in turn depend on 1.1. In addition, 1.3 
and 1.1 depend inductively on 1.9. 

We use simple induction in this theorem to correspond exactly to the definition of the 
domain E D ; i.e. using a basis and an induction step. This form of definition was chosen for 
clarity and precision. The nature of the domain would, however, allow a proof by strong 
induction (without a basis step), since the theorem only requires induction in the cases when 
there exist non-null subtrees. Rather than redefine the domain or create unnecessary 
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1.9 PARSEa 



1.8 PARSEb 



1.4 NUL-TYPE 



1.7 LEF-TYPE 





1.3 PREFIX 1.2 NILFIX 



1.5 POSTFIX 1.6 INFIX 




1.8PARSEb 
(Induct ion) 



9 PARSEa 
(Induct ion) 



Figure 2. Interdependence of Theorems 1.1-1.9 
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confusion, strong induction is not used. 

We now want to examine the five conditions we will impose on our input string in 
order to guarantee that PARSE returns the correct value. When viewed relative to a call to 
PARSE, they have the following interpretations. Condition 1 requires that the input string 
begin with a sentence S of the language. Condition 2 insures that no subexpression on the 
right end of S becomes associated as a left argument to t ktl , if t k ,, is an operator. If t k ,j 
is a delimiter then condition 3 prevents its inclusion in any annotation within S. The right 
binding power of the call to PARSE must be low enough for the entire expression to be 
returned, condition 4, but not so low that the expression is given as a left argument to t k+1 , 
condition 5. 

Statement of Theorems 1.1 through 1.9 

We precede our list of nine theorems by a formal statement of the conditions CI 
through C5, on which they depend. For convenience in the proofs, the first three have 
been broken down into their definitional components: conditions Cla through Clf are 
equivalent to CI, C2a and C2b are equivalent to C2, and C3a and C3b are equivalent to C3. 

Conditions: 

CI. eeE' D andS =» aOP0w - U D (e) - t,...t k 

Cla. a - U D (e|. ft ) if e|, ft exists (X otherwise), - U D (e e ) if e exists (A otherwise), 
and w - djVj. . .d n Y n for n>0, where Y f - U (ej) for lslsn when ej is non-null 
(X otherwise), and e teft , e e , 6| , . . . e„eE' D when they exist and are non-null. 

Clb. r-indextei.,,] > I bp lop] if e tat ( exists. 

Clc. rbplopl < l-indexle ] if e exists. 

CId. rbplop] < I -i ndex [e;] for l<i<n, when e; is non-null. 

Cle. d| « c-set[e ] if e and d, exist. 

Clf. dj « c-setEej.!] for l<isn when ^ is non-null. 

C2. r-index[e] > lbp[t k+1 l. 

C2a. rbp lop] > Ibp [t k+) ] if e n exists and is non-null. 
C2b. r-index[e n ] 2: I bp[t kt) ] if e n exists and is non-null. 
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C3. t k< ., «: c-setlel. 

C3a. t k4 , e «nr pto ^le e B >. 

C3b. t k ,, « c-setie n ] if e n exists. 

C4. rbp < I -index lei. 

C5. rbp i. Ibp-[t k4| ]. 

Notation: 

(t) When writing LISP expressions, upper case words and parentheses will always 

refer to LISP code; when describing known values within LISP expressions, 

tower case and square brackets will be used. Specifically, the meta-variabte op 

represents the token defined for op. 
(ii) The representation of the annotation part produced by FIND is 

( (d| repr te,] ) . . . (d„ repr Ce„J ) ) and wifl be written repr le, , . . . e„l . 
(Hi) Since the representations of patterns are not manipulated in this program, we 

wilt abbreviate repr Ip lop] 1 to simply p top] . 
(iv) In proofs, we will use the names CI, C2, etc to refer to the given conditions for 

the theorem being proved; Cl\ C2\ etc wttl refer to the antecedents to be 

satisfied when using Theorems 1.8 and 1.9 inductively. 

We now state the nine theorems in full. 

Theorem 1.1 (FIND) : Given C1-C3 for some e. Then 

(FIND rbp top] (ni I u t k .]...l plop]) - (repr [e lt . . .e n l t k+1 ...) 
Theorem 1.2 (NILFIX) : Given C1-C3 for some e. If op is defined NILFIX, 

(NILFIX op (w t k ,,...) rbplop) plop}) - (repr [e] t k+J ...) 
Theorem 1.3 (PREFIX) : Given C1-C3 for some e. If op is defined PREFIX, 

(PREFIX op ((3 w t M ...) rbplop) plop]) - (repr Eel t k4 |...l 
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Theorem 1.4 (NUL-TYPE) : Given C1-C3 for some e. If op is defined N1LFIX or 
PREFIX, then 

(NUL-TYPE {op w tfc,,...)) - (reprtel t k+1 ...) 

Theorem 1.5 (POSTFIX) : Given C1-C3 for some e. If op is defined POSTFIX, 

(POSTFIX reprle^,,] op (u t^...) rbp[o/>I plop!) - (repr lei t k .|...) 
Theorem 1.6 (INFIX) : Given C1-C3 for some e. If op is defined INFIX, 

(INFIX repr le MX ] op (0 w t k+1 ...) rbplop] plop]) - (repr lei t k+] ...) 

Theorem 1.7 (LEF-TYPE) : Given C1-C3 for some e. If op is defined POSTFIX or 
INFIX, then 

(LEF-TYPE (repr[e(, ft ] op (3 w t kt] ...)) - (reprlel t kt ,...) 

Theorem 1.8 (PARSEa) : Given C1-C4 for some e and rbp. Then 

(PARSE rbp (S t k ,,...)) - (ASSOC rbp (repr Tel t k+ ,...)) 

Theorem 1.9 (PARSEb) : Given CI-C5 for some e and rbp. Then 

(PARSE rbp (5 t k ,,...)) - (repr [el t kt ,...) 

Proof of Theorems 1.1 through 1.9, Basis Step 

For the basis step we assume that the tree eeE'o is a single node whose label we 
denote op. Then op is defined NILFIX, \xplop], and t, » 6 «■ Hole) - op (n-8, 
k=l), so the annotation part is <a-X. Note that since op is defined NILFIX, Theorems 1.3, 
1.5, 1.6, and 1.7 are not applicable. 

Theorem 1.1 (FIND): If w-X matches plop! and if tj * cont ploi>'i **' t * ,en 
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(FIND rbptoj&l (nil t 2 ...) plop)) - (nil t 2 ...) 

Proof: The proof is by induction over the definition of the pattern plop]; the six possible 

cases are handled by the six conditional clauses in the program. 
Case 1. If plop] - * then (FIND rbplop) (nil t 2 . . . ) plop!) - (nil t 2 . • • ) 

immediately. 
Cases 2,3. Impossible since if p lop] = "d" or "d" ~ it could not be that **plopl. 
Case 4. If plop) - qr, then (FIND rbp to/?] (nil t 2 . ..) plop]) 
" (FIND rbplop] (FIND rbp. (nil t 2 ...) q) r) by the program. We now use 

induction on the expression (FIND rbp (nil t 2 ...) q). Since *«p!o/>] and 

t 2 « cont [ £j (A), we know by Lemma 2a that 'X«q and by Lemma 8b that 

t 2 e corzt (>), so this expression is (ni I t 2 . . . ) and we have 
= (FIND rbplop] (nil t 2 ...) r). As above we have \*<r and t 2 tcont r l"X), so by 

another induction we have 
= (nil 1 2 . • • ) • 
Case 5. If p lop] = iq\ r) , then (FIND rbplop] (nil t 2 . . . ) p lop] ) is a conditional 

with three clauses. The first test is (MEMBER t 2 first _), using Lemma 13 for the 

correctness of FIRST. Since t 2 t cont D ^ ^ C\), we know that t 2 « fast* by Lemma 
9a, and this test will fail. Similarly the second test (MEMBER t 2 first r ) will fail. The 

third test (LAMBDA-P p lop] ) will be true by Lemma 12 and our assumption, so the 
result is (ni I t 2 . . . ). 
Case 6. If plop] « iq)*, then (FIND rbp lop] (nil t 2 . . . ) p lop] ) is a conditional 
with two clauses. The first test is (MEMBER t 2 first ). Since t 2 l cont (X) we have 

by Lemma 10 that t 2 t first _, and the test fails. The second clause then always returns 
(ni I t 2 . . . ) . I 

Theorem 1.2 (NILFIX) : Given CI-C3 for some e. If op is defined NILFIX, 

(NILFIX op (t 2 ...) rbp[ojt>] p lop] ) - (reprlel t 2 ...). 

Proof: From the program we have the expression 
- (CONS (APPEND {op) 

(CAR (FIND rbp lop] (nil t 2 . . . ) plop)))) 
(CDR (FIND rbp lop] (nil t 2 . . . ) plop)))) 
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By Theorem 1.1 we know the call to FIND returns (nil t 2 . . . ), so we have 

- (CONS (APPEND (op) nil) (t 2 . ..)), which is 

- (repr (el t 2 . . . ) by the definition of representation.! 

Theorem 1.4 (NUL-TYPE) : Given C1-C3 for some e. If op is defined NILFIX or 
PREFIX, then 

(NUL-TYPE iop t 2 ...)) = (repr (el t 2 . . . ) 

Proof: By the program and Axiom 1 covering definitions, we have 

- (NILFIX op (t 2 ...) rbplo/?] p lop] ), which by Theorem 1.2 is 

- (repr tel t 2 ...).l 

Theorem 1.8 (PARSEa) : Given C1-C4 for some e and rbp. Then 

(PARSE rbp (S t 2 . ..)) » (ASSOC rbp (repr [el t 2 . . . ) ) 

Proof: By the program we have 

- (ASSOC rbp (NUL-TYPE (5 t 2 . ..))), which by Theorem 1.4 is 

- (ASSOC rbp (reprtel t 2 . ..)).! 

Theorem 1.9 (PARSEb) : Given C1-C5 for some e and rbp. Then 
(PARSE rbp (S t 2 . ..)) - (repr (el t 2 . . . ) 

Proof: By Theorem 1.8 we have 

- (ASSOC rbp (reprlel t 2 . ..)), which makes the test (LESSP rbp lbp[t 2 l). By C5 
this is false, so the value is 

=■ (repr [el t 2 . ..).! 

Proof of Theorems 1.1 through 1.9, Induction Step 

We assume that the tree eeE'o is a node, whose label we denote op, with subtrees. 
We assume Theorems 1.8 and 1.9 inductively for any of these subtrees. 
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Theorem 1.1 (FIND) s If w -d,V, . . . d n Y„ matches plopl for nit where V f - U D (ej) and 
Cj e E' D for l<i Sn, and if Cld, Clf, C2b, CSa, and Ob hold for w, then 

(FINO rbpfo^l (nil u t M ...) pu^J) - (repr te,, . . . e n ] t^j...) 

Proof: Since FIND is called recursively, in general there will be some annotation fragment 
v, not necessarily nil, whkh has already been parsed at some previous stage of the 
execution. Thus, we wHI actually prove a more general assertion than that in the 
problem statement itself: 

(FIND rbplo^I Iv o> !,,»,...) ptspl) « (oorepr te, , . . . e„l \ M ), 
where, for convenience, the result of appending two lists a and b is written aob. As in 
the basis step, the proof is by induction over the definition of the pattern ploph the six 
possible cases are handled by the six separate clauses of the conditional statement. 

Case 1. If plop] - * then (FINO rbpte^l f» u t M ...r ptopl) 

- (» w t k , l . . . ) . Since w-;x, then n-0 and k-8, so 

- Iv t ,...). But repr [e,,...e n ] -nil so we have 

- (voreprte,, ....«„] t k ,,...). 

Case 2. If p lop] » "d" then we must have w - d - t,. By the program then, since d 
matches t,, (FIND rbplop] iv o> t M ...) plop]} 

- (CONS (APPEND v ((d))) (t 2 ...)) 

- (yeHd)) t 2 . ..) which is 

- (w«repr [e,, . . .e n l t^j...). 

Case 3. If plop] - "d H * then we must have w - d,V, where y, - Uo(e,). By the 
program we have (F I M) rtsplop] Iv w V,...) plop]) 

- (CONS (APPEND v (LIST (LIST d (CAR (PAUSE rbplop] (V, t k+1 . ..))))) ) 

(COR (PARSE rbplop] (V, t M ...)))) 
We apply Theorem 1.9 inductively to (PAUSE rbp lopl (V, t M . . . ) ) . Cf is satisfied 
by our assumption about «. Since n-1, C2' and C3* are satisfied by C2b and C3b 
respectively. Finally, C4' and C5' are satisfied directly by CW arid C2a. We have then 
(PARSE rbplop] (V, t M ...)) - (reprle,] V,.. .), so our result is 

- (CONS (APPEND v ( (d reprle,!))) (t kt ,...») 

- (iwreprte,,...e n ] t kt ,...). 

Case 4. If plop] - tfr then we must have <■> - «,w 2 where 
w, - 1 1 — t j - d,v,... d m v m matches*, with »2t and jJtS, and 



73 



w 2 " "t j» i - - - *», » d m .,Y mtl . ..d n V n matches r, with n>« and k£j. 
By the program we have (FIND rbplop] iv u t k .|...) plop]) 

- (FIND rbplop] (FIND rbpiopl iv w t kt ,...) q) r). We first apply our inductive 
assertion to the nested expression which is equivalent to 

(FIND rbplopl iv u, t H ...) q). Conditions Cld' and Clf about the internal 
properties of «! follow directly from Cld and Clf respectively. C2b' through C3a* deal 
with the token t j4l so we must deal separately with the cases where o> 2 »«;\ anc * W2* 5 *- If 
a>2»*A, then tj +1 must be a delimiter by Lemma II, so Ibp [t k+) ] - 0, satisfying C2b' and 
C2a'. In this case t j+ | is also the delimiter d m4l , so C3b' is satisfied by Clf. Finally, we 
know that t j+1 e first r and by Rl cont n first r - <f>, so C3a* is satisfied. If w 2 «X, 
then t H -t ktl , and m-n. In this case, since e m =e n , C2b\ C2a\ and C3b' are satisfied 
directly by C2b, C2a, and C3b. Finally, since t j+) - t ktl t conf-^j (e,, . . .e n ) by 

C3a, Lemma 8b says that tj, | € eonf (e lf .. .e n ), satisfying C3a'. We have then by 

induction that this nested expression is (iwrepr Ie lf . . . ej tj tI . . . ) , so we have 

- (FIND rbplop] (verepr [e It . . .e m l « 2 t ktI ...) r). We again use the assertion 
inductively. As before Cld' and Clf are directly satisfied, but since the last part of w 2 is 
also the last part of w conditions C2b\ C2a\ and C3b' are also directly satisfied. Since 
t k ,i « conr^^j (e lf . ..e n ), Lemma 8 says that t k+ | « cont r ie mi , .. .e„), satisfying 
C3a'. We have finally, 

- (yerepr [e^ . . . e m lerepr [e ffl4 |, . . .e n ] t kt] ...) 

- (werepr lej, . . . e n l t kt) ...) 

Case 5. If p lop] » ( q \ r ) then by the program we have 
(FIND rbplop] iv w t k+ ,...I plop]) 

- (COND ((MEMBER t, first q ) (FIND rbplop) Iv o> t h .,...) q) ) 

((MEMBER t, first r ) (FIND rbplop iv W t k .,...) r) ) 

((LAMBDA-P plop]) Iv w t M ...))J 
It must be that either w** matches q, wkX matches r, or u-X. In the first case we have 
1 1 e first-, so the first test is true, and we get the value of 

(FIND rbplop] Iv w t k+ ,...) q). All conditions for induction are satisfied 
automatically except C3a\ From C3a we have that t ktl t cont _ ^j (e| , . . . e n ) , but 
then Lemma 9b tells us that t k+1 « cont _(ej, . . .e n ). By induction then, this returns 
the correct value. In the second case, the first test will fail because t j e first r , and R2 
says that first n first r - <p. The second will be true, and as above the correct result 
will be returned. In the final case where «-X, the first two tests must fail for the 
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following reason: C3a says that t k4l = tj « cont -q,*] (e |( . . .e n ) - cant- ^-j (A), so 

we know by Lemma 9a that t! can be in neither first- nor first r By Lemma 12 

(LAflBDA-P plop)) will be true, so iv u t|, +I ...) is returned; since 
repr le t , . . . e n ] - n i I , we here too get 

- (yerepr [ej, . . .e n l t^j). 

Case 6. If p[o£] - iq)* then by the program (FIND rbpfo/>] (» w t M ...) pto/»l) 

- (COND ((MEMBER t, /inr^) 

(FIND rbp[o/>] (FIND rbp[o/>] (y u t k4 ,...) q) plop))) 
(T (y w t M ...))) 
By definition either w=X or w=W|. . . w f for r>8, where Wj«Qr for lsisr. If w=»A then 
the first test must fail for the following reason: from C3a we have 
*i * *k*i * cont n[ h]l e \>'" e jJ - cont M. But by Lemma 10a we know then 
t! « first so the correct result is returned. For r>8 we prove the assertion by 
induction on r. 
n»8 . Then we have w = W| = djY,. . .d n Y n matches q. By the program we have 
(FIND rbplop] iv u, t M ...) plop)) 

- (COND ((MEMBER t, first ) 

(FIND rbplop) (FIND rbplop] iv w, t^,...) q) plop))) 
(T iv w t t k+ |. . . ) ) ) 
By our assumption the test t x e /irjf will be true. We first apply our induction 
hypothesis on patterns to the nested expression (FIND rbplop] iv W| t k+J . . . ) q) . 
We know by assumption that u ) «<qr. Conditions Cld' through C3b' are satisfied directly 
by Cld through C3b respectively. From C3a we know that t k+1 « con 'n[o*] te i» • • • e n^- 
By Lemma 10b we know that t k+ | « cont iw n ), where w„-«|, so C3a' is satisfied. We 
then have 

- (FIND rbp [<?/>] (varepr [e 1( ...e n ] t ktl ...) plop]). Now we know by C3a that 
t kt | e cont »[ op ] (e,,.. .e n ) so we know by Lemma 10b that t M t first-. The test is 

false and the value of the program is 

=» (yerepr Ie|, . . .e n ] t k ,|...). 

n>6 . Then we have u - o)|d) 2 . . . w n where W| = tj...tj - d|V|. . .d m Y m matches q, 
with m> 8 and j>8, and w 2 . • •«„ - t jt) ...t k = d^jV^j. . .d n Y n matches ploy), with 
n>M and k>j. By the program we have (FIND rbplop] iv w t k »j...) plop)) 
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- (COND ((MEMBER t, ftrst q ) 

(FIND rbplop] (FIND rbpto^l Iv « t k ,,...)<?) plop])) 
(T iv w t k .,...))) 
By our assumption the test t| e first will be true. We first apply our induction 
hypothesis on patterns to the the string U| in the nested expression 
(FIND rbplop) iv wj w 2 ...w n t k ,|) q). Condition II' is true by our assumption 
about w. Cld' and Clf are true directly from Cld and Clf. By Lemma 5, tj t) , the first 
symbol in w 2 . . .«„ is a delimiter, and so I bp Ct^,] - 8, satisfying C2b' and C2a\ We 
also know that t jt ,=>d mt |, so C3b' is satisfied by Clf. Finally, since t^, e first we know 
by restriction R3 that t jt) t cont , satisfying C3a'. We have then the value 

- (FIND rbplop] (yerepr le if . . .e m l w 2 . .w n t ktl ) plop]).' We now apply the 
induction on n to the string w 2 . • . « n . Conditions Cld' through C3a' are directly satisfied 
by Cld through C3a respectively, so we have the value 

- (yerepr Ee lt . . .e m lerepr le mi , . . .e n l t kt! . ..) which is 

- (yerepr [ej, .. .e n l t kt) ...).l 

Theorem 1.2 (NILFIX) : Given CI-C3 for some e. If op is defined NILFIX, 

(NILFIX op (w t k4 ,...) rbplop) plop)) - (repr [el t k+ ,...) 

Proof: From the program we have the expression 
= (CONS (APPEND iop) 

(CAR (FIND rbplop] (nil w t ktl ...) ptop]))) 
(CDR (FIND rbplop] (nil w t ktl ...) plop)))) 
By Theorem 1.1 we know that the call to FIND returns (repr [e|,...e n ] t kt j. ..), so 

- (CONS (APPEND iop) repr te,, . . .e n l) (t ktl ...)) 

- (reprlel t k ,,...).l 
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Theorem 1.3 (PREFIX) : Given C1-C3 for some e. If op is defined PREFIX, 

(PREFIX op (0 w t k .,...) rbplop) ptopl) - (reprleJ t^...) 

Proof: From the program we have the expression 

- (CONS (APPEND lop) 

(LIST (LIST 'RIGHT (CAR (PARSE rbplop) (0 u t k .,...))>)) 
(CAR (FIND rbptyl 

(CONS NIL (COR (PARSE rttptefl 40 <* t k4 ,...)))) 
p[o/r]))) 
(COR (FIND rbplop] 

(CONS NIL (CDR(PARSE rbpu>£] 1$ u t^,...)))) 
plop))) ) 
Since - U D (e ) and e e eE'o, we know that if we can show our five conditions hold for 
e then we can apply Theorem 1.9 inductively in order to obtain 

(PARSE rbpto/rt (0 u> t k+ ,...)) - (reprleJ w t k ,,...). 
From Cla we obviously have CI' satisfied. Ck tells us that rbp io^l s I - 1 ndox Ie e l , 
which immediately gives us C4\ For C2\ C3', and C5* we must consider whether the 
annotation part w is the null-string or not If v*\ then the first token of « is the 
delimiter d,, so C3' is satisfied by Cle. Since lbp£t H l - 0, conditions C2* and C5* are 
also satisfied. If w-X, then n«8 and we immediately get CS' from CSb, C2' from C2b, 
and C5' from C2a. Thus, the value of the expression is 

- (CONS (APPEND top) 

( (right repr [e ])) 

(CAR (FIND rbplop} (nil u t ktl ...) plop)))) 
(COR (FIND rbpto^J (nil u t ktl ...) plop)))) 
By Theorem 1.1 we know that the value of the call to F I HO is 
(repr [e t .... e„J t k<1 . . . ) , so the value of the expression is 
» (repr (el t k4 |...).l 

Theorem 1.4 (NUL-TYF"E) : Given C1-C3 for some e. If op is defined NILFIX or 
PREFIX, then 

(NUL-TYPE lop w t M ...)) - (repr (el t M ...) 
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Proof: We consider the two possible cases. If op is defined N I LF IX then 

(GET op 'NUL-TYP) = NILFIX by Axiom I, so we have by the program and Axiom 1 

- (NILFIX op (0 « t k .,...) rbplop] p lop] ), which by Theorem 1.2 is 

- (repr [e] t ktl . . . ). Similarly, if op is defined PREFIX the correct value is returned by 
the program. Axiom 2, and Theorem 1.3. If there is no NUL-TYPE definition for op, then 
the value is 

- (NILFIX op ( t k .,...) 8 \), which is the default condition. In this case op is 
assumed to be nilf ix with no arguments and null pattern, so by Theorem 1.2 the correct 
value is returned. I 

Theorem 1.5 (POSTFIX) : Given C1-C3 for some e. If op is defined POSTFIX, 

(POSTFIX repr[e|, ft ] op (w t k ,,...) rbplop] plop]) - (repr [el t k .,...) 

Proof: By the program we have the value 

- (CONS (APPEND iop) 

((left repr[e teft D) 

(CAR (FIND rbplop) (nil w t k .,...) plop)))) 
(CDR (FIND rbplop] (nil u t M ...) plop)))) 
By Theorem I.I, the call to FIND has the value (repr Ie,, . . . e n ] t k<1 . . . ), so we have 
the expression 

- (CONS (APPEND [op) 

(deft repr Ie,.,,] ) ) 
reprlej,.. .e n l) 
(t ktl ...)) 
■ (repr [el t k+ ,...).| 

Theorem 1.6 (INFIX) : Given C1-C3 for some e. If op is defined INFIX, 

(INFIX repr[e|. f< ] op ((3 w t k .,...l rbplop] plop]) - (repr tel t k+) ...) 

Proof: By the program we have the expression 
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- (CONS (APPEND, f of I 

((left repr [e), ft ] ) ) 

(LIST(LIST 'RIGHT fCARfPARSE rbplopl i0 u t w . ..))))) 

(CM? (FIND rbplofl 

(CONS NIL (CORfPARSE ropfofl W « t k4 ,...)))) 
plof]))) 
(CDR(FIND rbplof] 

(CONS NIL (CDR(PARSE rtep[of](0 w t k<1 ...)))) 
plopl)) ) 
We use Theorem 1.9 inductively on the expression (PARSE rbplofl (0 <■> t^. . . M in 
exactly the same manner as in the proof of Theorem 13 ^PflEFIX), yielding the value 
(repr (e e l w tj,,., ....). We have then the expression 

- (CONS (APPENO top) 

((left repr [e w ,l ) ) 
((right repr[e c D) 

(CAR (FIND rbplo/rt (nil w W-.) plopl))) 
(CDR (FIND rbptyj (nil u t„.,...) pbpl})) 
By Theorem 1J we know that the call to FIND returns the correct value, giving us 

- (CONS (APPENO lop) 

((left repr le), f ,] ) ) 
((right repr[e e D) 
repr [e t , .. .e n l) 

(t M ...n 

■ (reprtel t k+ |...).l 

Theorem 1.7 (LEF-TYPE) : Given CI-C3 for some e. If op is defined POSTFIX or 
INFIX, then 

(LEF-TYPE (reprice,] op w t^,...)) » (repr (el ;t k .|...) 

Proof: We Consider the two possible cases. If op is defined POSTFIX then 

(GET op 'LEF-TYP) - POSTFIX by Axiom 3, so we have by the program and Axiom 3 

- (POSTFIX reprie Mt I op (w t k4 ,...) rbplofl p (of 1 ), which by Theorem 1.5 is 
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- (repr [el t kt , . . . ) . Similarly, if op is defined INFIX the correct value is returned by 
the program, Axiom 4, and Theorem 1.6.1 

Theorem 1.8 (PARSEa) : Given C1-C4 for some e and rbp. Then 

(PARSE rbp (5 t M ...)) - (ASSOC rbp (repr [el t kt ,...>) 

Proof: We consider the two possible cases: op is NUL-TYPE or LEF-TYPE. 
Case 1. If op is defined NILFIX or PREFIX then we have a=X in S - aop&u, so t, - op. 
By the program we have (PARSE rbp {.op w t ktl ..)) 

- (ASSOC rbp (NUL-TYPE lop (i u t M . ..))), which by Theorem 1.4 is 
■ (ASSOC rbp (repr te] t ktl ...)). 

Case 2. If op is defined POSTFIX or INFIX then a*X. We apply Theorem 1.8 inductively to 
the expression (PARSE rbp la op (i u t k+) ...)). From Cla we have cc - Uo^taft' 
where e (eff e E' D , satisfying CI'. From Clb we have r- index [e^l £ \bplop1, 
satisfying C2'. We do not allow LEF-TYPE operators to be used as delimiters, so since 
only delimiters can occur in c-set le^], C3' is trivially satisfied. From C4 we have 
rbp < l-indextel, and since I- index (el - mind bplo^l , l-index [e tef ,l], we have 
rbp < l-index [e| eft l, satisfying C4\ By induction, then, we have 
(PARSE rbp la op (1 w t ktl ...)) 

- (ASSOC rbp (reprle| eft ] op w t k ., ...)). The value of the call to ASSOC is a 
conditional whose first test is (LESSP rbp \bplop1 ). By the same argument we used to 
satisfy C4' above, we have rbp < I bp lop] , so the test is true and the result is 

- (ASSOC rbp (LEF-TYPE (reprle,.,,] op (3 w t k .,...)). By Theorem 1.7 this is 

- (ASSOC rbp (repr Ce] t k+1 ...)).l 

Theorem 1.9 (PARSEb) : Given C1-C5 for some e and rbp. Then 
(PARSE rbp (5 t k ,,...)) =» (reprlel t k .,...) 

Proof: By Theorem 1.8 we have (PARSE rbp (S t M ....)) 

- (ASSOC rbp (repr Ce] t ktl ...)). The value of the call to ASSOC is a conditional 
whose first test is (LESSP rbp I bp lop] ). By C5 this is false, so the second clause 
returns (repr tel t ktl ...).l 
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V.D PARSE Theorem II 

We complete this chapter with the proof of the second PARSE theorem stated in 
Section V.A: 

VSeS* (P D (S) halts error-free =» 6eS D ) 

where 2* is any string of tokens and S is the defined language as described in Section 
IV.B. The program PARSE is given as input a list of tokens; if it halts error-free then that 
string must be the linear representation of a grammatical expression tree. Notice that our 
work is simplified by the fact that we do not worry about the value returned by the 
program; this leads us to adopt the following convention. 

Notation: We write (...) - (...) to mean that the LISP expression on the left 
evaluates error-free to the value on the right. The presence of LISP expressions whose 
value need not be discussed will be indicated by (...) . 

We can now restate our theorem in terms of the program PARSE as follows. 

PARSE Theorem It: If 6eS*andif (PARSE -1 (6 -0) - <(...) -I), then SeS,,. 

Outline of Proof 

The statement and proof of this theorem closely parallel those of the first PARSE 
theorem. As before, our desired result is a corollary of the last in a series of nine subsidiary 
theorems, which correspond (in this case precisely) to the subroutines of the program PARSE. 
These theorems, however, are now in the converse form: whenever the subroutine returns a 
value certain properties are shown to be true about the input string. The proof is again by 
simultaneous induction with the theorems proven sequentially at each level. Their 
interdependence, including the inductive use of Theorem 2.9, is illustrated in Figure 3. The 
essential difference between the two PARSE theorems is the domain of induction; in this case 
we use induction on the length of strings in the set S*. 
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29 PARSE 




24 NUL-TYPE 



27 LEF-TYPE 



2.3 PREFIX 2.2 NILFIX 2.5 POSTFIX 2.6 INFIX 




2.9 PARSE 
(Induct ion) 



Figure 3. Interdependence of Theorems 2.1-2.9 
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Statement of Theorems 21 through 2.9 

Since we are given that SeS*, we will assume that the input list to PARSE is the list 
of tokens ( t , . . . t s ) , for ail, with the convention that t,--l. As in the proof of PARSE 
Theorem I, we use an inductive generalization, Theorem 2-9, which makes use of 
Conditions CI through C5. In this case, however, & through C5 are the consequents of the 
theorem. From CI we have the desired result that t j. . . t k c Sq. 

Conditions: 



CI. ecE' D andS - «0P(3« - U D (e) - t,...t fc 

Cla. « - U D i9 Wi ) if e^t exists (A otherwise), ^ - U (e e ) if e exists (A otherwise), 
and w - d|Y,. . . d n Y n for n*0, where Yj - Ugtej) for Isisn when e- t is non-null 
(X otherwise), and e^, e e , e t , . . . e n cEg when they exist and are non-null. 

Clb. r- i rtdex [e(, ft ] z I bp lop] if e^ exists. 

Clc. rbplo^l < l-index[e ] if e exists. 

Cld. rbp lopl < I - i ndex [ejl for Is i Sn, when e; is non-null. 

Cle. d| e c-set [e e l if eg and d, exist. 

Clf. dj * c-aet £e ; .j] for l<i in when e ( is non-null. 

C2. r-indextel > lbpCt k4 ,]. 

C2a. rbp lop] > I bp[t k4 .,l if e n exists and is non-null. 
C2b. r-indexfe n ] 2: I bp Ct k< .,3 if e n exists and is non-null. 

C3. t k ,, e c-setlel. 

C3a. t M « cont p[o p- i ie l ,...e n ). 
C3b. t|,,| « c-set Ie n ] if e n exists. 

C4. rbp < I -index [el. 

C5. rbp > lbp[t ktl l. 

We state now the theorems in full. 
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Theorem 2.1 (FIND); If (FIND rbp(o/>) (nil t H ...t,) plopl) - state for 
lsj<s then 

(a) state - ((...) t k+ |.. . t 8 ) where jsk<a 

(b) t H ...t k - w - d,Y,...d n Y n for n*8 Hhere Vj-U D (ej) and ej€E' D for lsisn, 
and <a«p. 

(c) Cld, Clf, C2a, C2b, C3a, and C3b hold for t H . . . t k . 

Theorem 2.2 (NILFIX); If t, - op isdefined NILFIX and 
(NILFIX op (t 2 ...t 8 ) rbpIo/>J plop]) - state, for l<s, then 

(a) state - ((...) t^j. .. t 8 ) where lsk<8 

(b) CI, C2, and C3 hold f or t , . . . t k . 

Theorem 2.3 (PREFIX): If t, - op isdefined PREFIX and 
(PREFIX op (t 2 ...t g ) rbpto^l plop)) - state, for l<s, then 

(a) state - ((...) t^. .. t 8 ) where l£k<s 

(b) CI, C2, and C3 hold f or t , . . . t k . 

Theorem 2.4 (NUL-TYPE) i If for 1<b (NUL-TYP (t,. . . t.) ) - state then 

(a) state - ((...) t k4l ...t 8 ) where l£k<s 

(b) CI, C2, and C3 hold for t,. . . t,, where t, is defined NILFIX or PREFIX. 
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Theorem 25 (POSTFIX): If t^, - op is defined POSTFIX, t,...t, - U te wi ) for 

e|, ft €E'o, r- index [e). f j] > Ibplt,,]], and 

(POSTFIX (...) op (t* 2 "*t s ) rbpfo/»l plop] - stata where a*l<s then 

(a) state - ((...) t k<t . . . t s ) where a<k<s 

(b) CI, C2, and C3 hold for t , . . . t k . 

Theorem 2.6 (INFIX): If t„, - op is defined INFIX, t,...t, - Ho 1 *™ 1 for 

e wt eE' D , r- index [e wt l > Ibplt^jl, and 

(INFIX (...) op (t^ 2 ...t 8 ) rbp lop] plop}} - state where a+l<s then 

(a) state - ((...) t^j.. . t,) where a<k<s 

(b) CI, C2, and C3 hold for t,. . . t„. 

Theorem 2.7 (LEF-TYPE) : If t,. . . t, - Uptew,) for e^eE'o, 

r-indexEe^,,] > lbp[t, t ,l, and (LEF-TYPE (...) ft„|... t t )> - stata where a<a 

then 

(a) state - ((...) t kt |. .. t 8 ) where a<k<s 

(b) CI, C2, and C3 hold for t,. . . t k where t„, is defined POSTFIX or INFIX. 

Theorem 2.8 (ASSOC) : If t , . . . t, satisfy CI, C2, C3, and C4, and 
(ASSOC rbp ((...) t H . . . t 8 ) - state where }<s then 

(1) if rbp > Ibpltj*,], state - ((...) t w ...t,). 

(2) If rbp < lbpCt M l, then 

(a) state - (ASSOC rbp ((...) t k ,,. .. t 8 )) where j<k<s 

(b) CI. C2, C3, and C4 hold for t,. . . t k where t H is defined POSTFIX or INFIX. 
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Theorem 2.9 (PARSE): If (PARSE rbp (t,..t,)) - state for l<s then 

(a) state - ((...) t k .,. .. t 8 ) where lsk<s 

(b) CI through C5 hold for t,. . . t K . 

Since this proof is essentially concerned with error handling, we precede the basis 
step with a preliminary lemma about the behavior of the PARSE program on trivially 
invalid inputs. The theorem itself deals with list arguments to PARSE of length two or 
more, and it is important to know that no value will be returned for shorter lists. 

Lemma 2.1: (PARSE rbp ( t , . . . t 8 ) ) returns an error if s<2. 

Proof: By the program (PARSE rbp (t,...t,)) 

- (ASSOC rbp (NUL-TYPE ( t , . . . t 8 ) ) ) , but NUL-TYPE immediately tests by evaluating 
(CDOR (t,. . . t,) ). If s<2 this will cause an error! 

Proof of Theorems 21 through 2.9, Basis Step 

For the basts step we assume that s-2, so the input string is ( t , t 2 ) - ( t j -0 . 
Since Theorem 2.9 is the final result and is the only theorem to be used inductively, it is the 
only essential part of the basis step proof. To prove Theorem 2.9 for the case s-2 we will 
also need Theorems 21, 22, and 2.4. 

Theorem 2.1 (FIND): If (FIND rbplopl (nil -0 ptepl) - state, then 

(a) state -((...) H) 

(b) X-u matches p lop] 

(c) Cld, Clf, C2a, C2b, C3a, and C3b hold for X. 

Proof: Since k<s, the second argument to FIND must be (ni I -0. We prove then the 
following assertion inductively over the definition of the pattern p lopl. If 
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(FIND rbplopl (nil 4) plop}) - state, then X«p and state - (nil 4). This 

assertion implies that u-A and so Cld and Clf are trivially satisfied. Since Ibp HI— 1 

C2 is satisfied, and C3 because -4 is not in the defined language. 
Case 1. If plop)~\ then true immediately. 
Cases 2,3 It cannot be that p r o/>J-"d" or H d" ~, because a value would only be returned 

if d«4 and we know that 4 is not part of the defined language 
Case 4. If p lop] *>qr then the value must be 

(FIND rbplopl (FIND rbpfop] (nil 4) q) r). By two uses of pattern induction 

we have A-< q and X«< r so \*plop] , and the final result (« 1 1 -0 . 
Case 5. If p lopl-iq\ r) then, since 4 cannot be in either ©f first- or first r it must be that 

\«p lop] and (ni I 4) is returned. 
Case 6. If plop] - (q)* then, since 4 cannot be in first r the result (ni I 4) is returned 

and "X^plop] by definition.! 

Theorem 2.2 (NILFIX); If op is defined NILF IX and 
(NILFIX op (4) rtoplop] plop}) - state then 

(a) state -' ((...) 4) 

(b) CI, C2. and C3 hold for t,. 

Proof: By the program (NILFIX op (4) rbplop] plopl) 

- (CONS (...) (CDfNFIND rbplop] (nit 4) plop]))). By Theorem 2J this is 

- ((...) 4) , and we know \»<plopl. Since op is defined NILFIX we have then that 
S-t,«o£«U D (e) for eeE' D , completing CI. We have already C2 and C3 from 
Theorem 2.1. 

Theorem 2.4 (NUL-TYPE) j If (NUL-TYP (t, 4)) - state then 

(a) state «((...) 4) 

(b) CI. C2, and C3 hold for t, where t, is defined NILFIX or PREFIX. 

Proof: NUL-TYPE only returns a value in three cases. 
Case 1. If t|-o/> is defined NILFIX then we have the value 

(N I LF I X op ( 4) rbp lop] p lop] ) and we are done immediately by Theorem 2.2. 
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Case 2. If t,-o/> is defined PREFIX then we have the value 

(PREFIX op H) rbplop] plop]). But PREFIX evaluates the expression 
(PARSE rbplop] (-0) which by Lemma 2.1 causes an error.l 

Case 3. If t,-o/» is not defined, then it is assumed by default to be a variable or constant; 
we have then the value (NILFIX op H) A) and we are again done by Theorem 2-2. 

Theorem 2.9 (PARSE): If (PARSE rbp (t, -I) ) - state then 

(a) state = ((...) H) 

(b) CI through C5 hold for t,. 

Proof: By the program (PARSE rbp (t, -0) - (ASSOC rbp (NUL-TYP (t, H))>. By 
Theorem 2.4 we know that this is - (ASSOC rbp ((...) -I) ) and that CI, C2, and C3 
hold. Since we know that I bp [HI —1, ASSOC returns the value ((...) H) , and C5 is 
satisfied. Finally, since op has no left argument we have l-index[el - », satisfying 
C4.I 

Proof of Theorems 2.1 through 2.9, Induction Step 

We now assume that s>2 and that Theorem 2.9 holds for strings of length less 
than s. 

Theorem 2.1 (FIND): If (FIND rbpIo/>] (nil t H ...t 8 ) plopl) - state for 
lsj<athen 

(a) state - ((...) t kt] . .. t 8 ) where j<k<s 

(b) tj»|...t k - u => d 1 Y 1 ...d n V n for n£0 Hhere Vi*U D (ei) and e^E^ for lsisn, 
and «•</>. 

(c) Cld, Clf, C2a, C2b, C3a, and C3b hold for t H . . . t k . 

Proof: The proof is by induction over the definition of the pattern p lopl ; the six possible 
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cases are handled separately by the six conditional clauses in the program. 
Case 1. If plop]** then (FIND rbpiopl (nil t H ...t s ) plop]) 

- ((...) t j, | . . . t s ) . In this case w-A which clearly matches p lop] . Only condition C3a 
is relevant to this case, but since p«A, we have cont- ^^j - <p. 

Case 2. If p lop] ="d" then the program will only return a value if d«t H . If it does, the 
value is 

- ((...) t jt2 ...t s ). It must be the case that j+l< a, since j +1-8 would imply that 
t H =H which we know cannot match any delimiter in the defined language. Clearly 
u=t j+1 matches p lop] , and since there is no e^ the only relevant condition to satisfy is 
again C3a. Since p="d", we have cont ^j = <£. 

Case 3. If p lop] ="d" ~, then the program will only return a value if d-tj^. If it does, 
the value is 

- ((...) (CDR (PARSE rbplop] (t j+2 .. . t s ) ))). We must have j+2<a, since PARSE 
returns an error otherwise by Lemma 2.1. We have then by an inductive use of Theorem 
29 the value 

- ((...) t^i . . . t s ) where j+2<k<s. We also know that the following conditions hold 
for y, = t jt2 ...t k : CI' Y^Uole,) for e,eE' D , C2' r-indexle,] > lbp[t M ], 

C3' t ktl «c-sette,], C4' rbplop] < l-indexle^], znd Cb' rbplop] Z lbp[t H l. We 
now show that the necessary conditions hold for tj,j. . . t k . We have first that 
t M . . . t k =dV, which clearly matches lop]. Cld is satisfied directly by C4\ Condition 
Clf does not apply, since n=l. C2a is satisfied by C5' and C2b by C2*. C3b is satisfied 
directly by C3' and C3a from the fact that cont ^^ »</> in this case. 

Case 4. If p lop] =qr, then the value of the program is 

- (FIND rbplop] (FIND rbplop] ((...) t H ...t 8 ) q) r). By pattern induction on 
the innermost expression we have the value 

- (FIND rbplop] ((...) t h4 ,...t s ) r) for j<h<9. We know that 

t j+ |. . . t h -d|Y,. . . d m Y m =W|, for 0<m, which matches q, and that all conditions (call them 
Cld', Clf, etc.) hold for w,. By another use of pattern induction we have the value 
= ((...) t k+1 . . . t 8 ) for h<k<s. We know that t M . . . t k -dn„ jV^j . . . d n Y m «« 2 , for msn, 
which matches r, and that all conditions (call them Cld", Clf", etc.) hold for « 2 . We now 
show that all conditions hold for t jt] . . . t k . Clearly j<k<3 and tj.j. . . t k =W|« 2 matches 
p lop] . Cld follows directly from Cld' and Cld". Clf follows from Clf and Clf with 
one exception. We must show that d^ec-set lej; i.e., that the first delimiter of « 2 is 
not in the c-set of the last argument of u ( . This case is covered by C3b\ If w 2 *A then 
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C2a follows directly from C2a", otherwise from C2\ Similarly C2b follows from either 

C2b" or C2b\ We know from C3a" that t M tcont r le m e n ). If « 2 ^A then by 

Lemma 8a we know t kt ,«conr j^j (e,,...e n ). If w 2 =A then t K+ ,-t h », and we also 
know that t H *«mr (e,, . . . e m ). But by Lemma 8b it is also true that 

*k*i *cont p [o . j (e e n ) , satisfying C3a. Finally, C3b follows from C3b" when u 2 i«A, 

otherwise from C3b\ 
Case 5. If p lop] = Iq | r) then the program only returns a value in one of three cases. If 
the first test is true, t jtl cf/rjr then we have 

- (FIND rbplop] ((...) t j+1 ...t 8 ) q. By pattern induction this is 

- ((...) t k+ | . . . t s ) , where all conditions except C3a are satisfied immediately. We know 
from C3a' that t ktl tcont _(ej, . . .e n ). By Lemma 9b, since w»«A in this case, we also 
know tk*\* cont p i p] te|, . . . e n ), satisfying C3a. If the first test is false and the second 
test, t )4 |€ first r ,is true then we have the same situation. If the first two tests are false, 
and the third is true, \-<plop], then the result is 

- ((...) t j+1 . . . t s ) , where <•>»*. In this case the only relevant condition is C3a. Since 
we know by the failure of the first two tests that t^,* first and t H « first r we know by 
Lemma 8 that \ )tX icont ,,[ a] (A). 

Case 6. If plop]<*lq)* then value of the program is a conditional whose test is t itl e first . 
If the test fails then the value is 

- ((...) t j+1 . . . t 8 ) . If the test succeeds, then the value is a a recursive call to FI ND for 
p lop) , after another Wj has been found to match q. We prove by induction on the 
number of calls to FIND for plop] made before returning. The hypothesis is that each 
time there is a call of the form (FIND rbplopl ((...) t M ...t 8 ) p lop] ) , then all of 
the conditions except C3a are true of the string t j+) . . . t h . Thus, when the test finally 
fails, we only need show that C3a is true to be done, but by Lemma 10 we know that if 
\. M iftrst q , when the test fails, and if t ktl «conf (e n ), which we know from the 
induction hypothesis C3b', then t M *cont p ^j (e, , . . . e n ) , satisfying C3a. We now 
prove the hypothesis. 

Basis: At the first call, we have j-h, or «=*. Since plop]-lq)*, we know that ««pto/>]. 

No other conditions are relevant to this case. 
Induction: If all conditions except C3a are true of t H . . . t h =w, . . . M (call these conditions 

Cld', Clf\ etc.), and if the test t ht] aftrst is true, then we have 

(FIND rbplop] ((...) (t M ...t 8 ) plop]) 

- (FIND rbplop] (FIND rbplop] ((...) t M ...t,l qr) plop]). We know by our 
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induction over patterns (the higher level induction in this theorem) that this is 

- (FIND rbplopl ((...) t k ,,...t 8 ) plop)), where the string 't^. .. t h -«j matches tf. 
We also know that all the conditions are true for this string (call these Cld", Clf", etc.). 
We now show that all conditions are true for the whole string t^ ( . . . t k4l . Since 

(i>i. . .Wj.i«<p and Wj*<«r, we know that tj,|...t k -w t ...»j«p. Cld is satisfied directly by 
Cld' and Cld". Clf is similarly satisfied by Clf and Clf" with one exception. We need 
to show that the first symbol of w n is not in the c-set of the last argument in «| . . . Wj_|, 
but this follows from C3b\ We recall that Lemma 8 says that A Cannot match q. Then 
we have conditions C2a, C2b, and C3b following directly from C2a*. C2b\ and CSb' 
respectively. Thus all conditions except C3a are satisf ied.l 

Theorem 22 (NILFIX): If t, - op is defined NILFIX and 
(NILFIX op (t 2 ...t 8 ) rbplo/>] ptopl) - state, for l<a, then 

(a) state » ((...) t k .|...t 8 ) where l*k<s 

(b) CI. C2, and C3 hoW for t,. . . t k . 

Proof: By the program (NILFIX op (t 2 ...t 8 ) rbplopl plopl) 

- (CONS (...) (CDR (FIND rbplopl (nil t 2 ...t 8 ) p lopl )) K So by Theorem 2.1 

- ((...) t htl . . . t 8 ) where 2sk<s. Since the annotation part t 2 . . . t k matches p lopl by 
the theorem and Cld and Clf hold, we have satisfied CI, because Clb, Clc, and Cle are 
not relevant to the NILFIX case. Then t,. . . t k - W^eJ for eeE'o- By Theorem 2.1 we 
also have C2a, C2b, C3a, and C3b, which give us C2 and C3 for e.l 

Theorem 2.3 (PREFIX) : If t, - o^ is defined PREFIX and 
(PREFIX op (t 2 . ..t 8 ) rbplopl plopi) - state, for l<s, then 

(a) state - ((...) t M . .. t 8 ) where lsk<s 

(b) CI, C2, and C3 hold for t t . . . t k . 

Proof: By the program (PREFIX op (t 2 ...t t ) rbptopl plop!) 
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- (CONS (...) (CDR (FIND rbplop] 

(CONS NIL (CDR (PARSE rbp lop] ( t 2 . . . t 8 ) ) ) ) 
plop]))). 
We first consider the expression (PARSE rbplop] (t 2 . . . t 8 ) ). It must be the case that 
s>2, otherwise PARSE causes an error by Lemma 2.1. We have then by the inductive use 
of Theorem 2.9 that the result is 
-(CONS (...) (CDR (FIND rbp lopli (nil t jt ,...t 8 ) p lop] ))), where we know the 
following about t 2 . .. tj - (3: CI'(3-U D (e e ) for e eE' D , C2' r- index le ] > lbp[t H l, 
C3' t H ec-set[e ], C4 rbplop] < l-index[e ], and C5' rbp[o/>] > I bp [t^,3. 
Finally, we know that jo, so we apply Theorem 2.1 and get 

- ((...) t k+I . . . t s ) , where we know k<s and that Cld, Clf, C2a, C2b, C3a, and C3b 
already hold for the expression t|. . . t k . We satisfy the others as follows. We have now 
tj. . . t k =o/>|3w where op is defined PREFIX, and annotation part u matches p lop], 
satisfying Cla. Clb is not relevant to this case. CIc is satisfied by C4\ Cle is only 
relevant if «*>, in which case it is satisfied by C3\ If w*!X then e n is part of the 
annotation w and conditions C2 and C3 follow from C2a, C2b, C3a, and C3b obtained 
from Theorem 2.1. If u=X, then e n -e . In this case C2 is satisfied by C5\ and C3 is 
satisfied by C3a and C3'.| 

Theorem 2.4 (NUL-TYPE) ; If for l<s (NUL-TYP (t,..t 8 )) - state then 

(a) state - ((...) t k4l . .. t s ) where l<k<s 

(b) CI, C2, and C3 hold for t ,. . . t k where t, is defined NILFIX or PREFIX. 

Proof: NUL-TYPE returns a value in three possible cases. 
Case 1. If t i=o/» is defined NILFIX then we have the value 

- (NILFIX op (t 2 . ..t s ) rbplop] p lop] ) and the result is immediate by Theorem 22. 
Case 2. If t,=o£ is defined PREFIX then we have the value 

- (PREFIX op (t 2 . . . t s ) rbp lop] p lop] ) and the result is immediate by Theorem 2.2. 
Case 3. If t, is undefined, then it is assumed by default to be NILFIX with rbp-8 and 

p lop] »A. We have the value 

- (NILFIX op (t 2 . ..t 8 ) 9 A) and the result is immediate by Theorem 2.2.1 
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Theorem 2.5 (POSTFIX) : If t„, - op is defined POSTFIX, t,. . . t, - UoCe^,) for 

e|, f1 eE'D, r- index Ie|, tt ] > lbp[t„|], and 

(POSTFIX (...) op (t„ 2 ...t,) rbpto^l pl&pl - state where a+l<e then 

(a) state - ((...) t kt) ...t,) where a<k<s 

(b) CI, C2, and C3 hold for 1 1 . . . t k . 

Proof: By the program we have (POSTFIX (...) op (t^,. ..t,) rtoplopl plop!) 

- (CONS (...) (COR (FIND rtoplop] (nil t„ 2 ...t 8 ) pUpl))). So by Theorem 2.1 

- ( (. . . ) t k ,,. . . t s ) for a+lsk<s, and w-t^. . >% matches plop] with Cld, Clf, C2a, 
C2b, C3a, and C3b already true. Since t,. . . t.-a-^Ce^) by assumption we have 

t , . . . t k «ao£«, satisfying Cla. Ctb is satisfied by given, and Ck and Cle are not 
relevant to the POSTFIX case, so we have t,. . . t k -U,)(e) for e«?E'o, satisfying CI. C2 
and C3 now follow directly from C2a, C2b, C3a, and CSb.I 

Theorem 2.6 (INFIX): If t. tl - op is defined INFIX, t,...t, - Uo(e taft ) for 

efef^Eo, r- index Ce^l > Ibptt,^], and 

(INFIX (...) op (t„ 2 ...t 8 ) rbplop] plop]) - state where a+l<s then 

(a) state - ((...) t k4 |...t s ) where a<k<8 

(b) CI, C2, and C3 hold for t, . . . t k . 

Proof: By the program we have (INF IX (...) op (t*, 2 . . . t,) r top lop] plopl) 

- (CONS (...) (CDR (FIND rbplop] 

(CONS NIL (CDR (PARSE rtopUpl (t„ 2 .. . t,)))) 

p lop] ) ) ) 
Using the same argument as in Theorem 2.3, substituting (t„j. . . t 8 ) for (t 2 . . . t 8 ), we 
apply Theorems 2.9 and 2.1 in order to get 

- ((...) t k . , . . . t s ) . Conditions CI-C3 are also satisfied for the same reasons as in 
Theorem 2.3, with the exception of Clb which is no longer irrelevant but is satisfied by 
the given.l 
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Theorem 2.7 (LEF-TYPE) : If t,. . . t a - U D <«W for e toft eE' D , 

r- indexle^,,] > Ibplt,.,], and (LEF-TYPE (...) (t a .|...t s H - state where a<s 

then 

(a) state - ((...) t M . .. t 8 ) where a<k<s 

(b) CI, C2, and C3 hold for t,. . . t k where t„, is defined POSTFIX or INFIX. 

Proof: It must be the case that a+l<s, otherwise LEF-TYPE returns an error by checking 

(CDDR ( t atl . . . t s ) ) , and LEF-TYPE only returns a value in following two cases. 
Case 1. If t a4 ,=o£ is defined POSTFIX then we have the value 

- (POSTFIX (...) op (t 2 . ..t 8 ) rbp lop] p lop] ) and the result is immediate by 
Theorem 2.5. 

Case 2. If t a+ ,=«o/> is defined INFIX then we have the value 

- (INFIX (...) op (t 2 ...t 8 ) rbplop] p lop] ) and the result is immediate by 
Theorem 2.6.1 

Theorem 2.8 (ASSOC) : If t, t, satisfy CI, C2, C3, and C4, and 

(ASSOC rbp ((...) tj tl ...t 8 ) = state where j<s then 

(1) if rbp > lbp[t H ], state - ((...) t H ...t 8 ). 

(2) If rbp < lbp[t H l, then 

(a) state - (ASSOC rbp ((...) t M . .. t 8 )) where j<k<s 

(b) CI, C2, C3, and C4 hold for t,. . . t„ where t w is defined POSTFIX or INFIX. 

Proof: The program is a conditional which tests (LESSP rbp lbp[t H l ). If the test is 
true then we have part I. If false we have 

- (ASSOC rbp (LEF-TYPE ((...) t H . . . t 8 ) ) . From the given we know CI', C2\ C3\ 
and C4' for t t . . . tj. By CI* and C2' the conditions for Theorem 2.7 are satisfied so we 
have 

- (ASSOC rbp ((...) t k+1 ...t 8 ) where j<k<s. We know further that CI, C2, and C3 
hold for t,. . . t k where t H is defined POSTFIX or INFIX. C4 holds since we have 
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rbp < I bp[tj„ | ] by assumption and C4'.l 
Theorem 2.9 (PARSE): If (PARSE rbp (t,..t,)) - state for l<s then 

(a) state - ((...) t^j. .. t 8 ) where lsk<a 

(b) CI through C5 hold f or t , . . . t k . 

Proof: By the program (PARSE rbp ( t , . . . t,) ) 

- (ASSOC rbp (NUL-TYP ( t , ... . t 8 ) ) ) . By Theorem 2.4 we know that this is 

- (ASSOC rbp ( (. . . ) t H . . . t 8 ) where lsj<s, and CI, C2, and C3 hold for t,. . . t,. 
Since t, isdefmed NILFIX or INFIX we know by definition that l-indexlel - •», so 
C4 is also satisfied. We know by Theorem 2.8 that ASSOC either returns when 

rbp top) > Ibp [t H l or calls itself recursively with conditions CI, C2, C3, and C4 still 
satisfied. By induction, when ASSOC does halt, CI, C2, C3, and C4 still hold. In 
addition condition C5 is satisfied by the failure of the test Clearly ASSOC must 
eventually halt, since at each call we know j<k<s; i.e., every call removes more symbols 
from the input stream.l 
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VI. CONCLUSIONS 



VI.A Summary 

We began with the observation that BNF is not effective as a practical 
metalanguage for programming language designers, implementers, and users. We used 
Pratt's CGOL technique for translator construction, and specified a meta-language which 
avoids many of the difficulties inherent in BNF approaches. Its essential feature is an 
expressive power which is very closely related to the actual parsing technique of the 
translator: we can conveniently describe exactly those languages which the translator 
technique handles well. An immediate consequence is freedom from the awkward 
restrictions inherent in most automatic translator construction systems. 

We have demonstrated these advantages by presenting the design of a CGOL based 
parsing program; although the meta-language is based on Pratt's informal syntactic 
guidelines, we have demonstrated with a formal correctness proof that none of the rigor of 
more traditional approaches has been sacrificed. The first part of this proof deals 
exclusively with properties of the meta-language; these results permit a very straightforward 
program proof, and may be applied equally well to proofs of other implementations. 

VLB Further Work 

The use of nonstandard syntactic descriptions is an open area for research. The 
example presented in this paper treats a class of languages appropriate to the CGOL 
technique; it should be feasible to apply the same approach in other, perhaps more 
specialized, contexts. Even within the CGOL system there are a number of issues which 
need more thought. For example, the meta-language presented in this paper uses regular 
expressions to specify multiple right arguments. More than half of the proof is devoted to 
patterns, and the parser for them is the one long program in the system. The generality of 
regular expressions may not be worth the effort involved. Other unresolved issues deal 
with delimiters, e.g. it is not absolutely necessary that they have left binding powers of zero. 
This convention was imposed for simplicity. 
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There are also a number of unfinished implementation issues. The LISP 
implementation of the parser is much longer and less efficient than necessary but could be 
immediately improved by the use of global variables and side effects. The actual parser 
should be as short as most of the definitions for CGOL given in [Pratt 19741 In addition, 
an actual implementation of the meta-language processor is desirable. This could take the 
form of an interactive definitional facility, providing the designer with on-line assistance, 
such as production debugging, and with incremental implementation, eg. for bootstrapping. 
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SUMMARY OF NOTATION 



p, qr, r are CGOL annotation patterns 

~ is a metasymbol used in productions to denote the presence of ah argument 

D is a language definition (a set of productions) 

e is an expression tree 

E is the set of expression trees corresponding to a definition D 

E' D is the set of grammatical expression trees corresponding to a definition D. 

op is an operator 

t is a token, a lexeme 

d is a delimiting token 

a , (3 , V , 6 , w are strings of tokens 

X is the empty string 

S D is a set of strings, over some alphabet of tokens, corresponding to a definition D 

U D is a writing function defined on E with values in S D 

P is the parse function corresponding to a definition D 
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